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Preface 



The 11th international conference on Inductive Logic Programming, ILP2001, 
was held in Strasbourg, France, September 9-11, 2001. ILP2001 was co-located 
with the 3rd international workshop on Logic, Learning, and Language (LLL 2001), 
and nearly co-located with the joint 12th European Conference on Machine 
Learning (ECML2001) and 5th European conference on Principles and Practice 
of Knowledge Discovery in Databases (PKDD2001). 

Continuing a series of international conferences devoted to Inductive Logic 
Programming and Relational Learning, ILP2001 is the central annual event for 
researchers interested in learning structured knowledge from structured examples 
and background knowledge. 

One recent one major challenge for ILP has been to contribute to the expo- 
nential emergence of Data Mining, and to address the handling of multi-relational 
databases. On the one hand, ILP has developed a body of theoretical results and 
algorithmic strategies for exploring relational data, essentially but not exclusively 
from a supervised learning viewpoint. These results are directly relevant to an 
efficient exploration of multi-relational databases. 

On the other hand. Data Mining might require specific relational strategies to 
be developed, especially with regard to the scalability issue. The near-colocation 
of ILP 2001 with ECML 200 1-PKDD 2001 was an incentive to increase cross- 
fertilization between the ILP relational savoir-faire and the new problems and 
learning goals addressed and to be addressed in Data Mining. 

Thirty-seven papers were submitted to ILP, among which twenty-one were 
selected and appear in these proceedings. Several - non-disjoint - trends can be 
observed, along an admittedly subjective clustering. 

On the theoretical side, a new mode of inference is proposed by K. Inoue, 
analog to the open-ended mode of Bayesian reasoning (where the frontier be- 
tween induction and abduction wanes). New learning refinement operators are 
proposed by L. Badea, while R. Otero investigates negation-handling settings. 
Rule stretching (M. Eineborg and H. Bostrdm) can also be considered a new 
inductive-deductive operator. 

Several hybrid frameworks are proposed, either bridging the gap between ILP 
and other learning paradigms, e.g. Bayesian inference (K. Kersting and L. De 
Raedt), Neural Nets (R. Basilio, G. Zaverucha, and V. C. Barbosa) or Feature 
Selection (T. Ozabaki and K. Furukawa) - or exploiting other search paradigms, 
e.g. Constraint Satisfaction (J. Maloberti) or Genetic Algorithms (A. Brand and 
C. Vrain), to address particular ILP tasks. 

Among the tasks addressed, changes of representation take an increasing 
importance, ranging from propositionalization (A. Brand and C. Vrain, already 
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mentioned) and construction of structural features (S. Kramer), to aggregation- 
based transformations (M.-A. Krogel and S. Wrobel). 

Committee-based and statistical machine learning interestingly pervade ILP 
through boosting-based approaches (S. Hoche and S. Wrobel), positive-only 
learning (F. Zelezny), and transductive inference (M. Eineborg and H. Bostrom, 
already mentioned). Last but not least, an efficient relational cross-validation 
procedure is proposed by J. Struyf and H. Blockeel. 

The application papers deserve a special mention as they demonstrate when 
and how relational representations can make the difference. Language-related 
applications range from Natural Language (M. Nepil) to XML documents (A. 
Yamamoto, K. Ito, A. Ishino, and H. Arimura), and shell logs (N. Jacobs and 
H. Blockeel). Bio-informatics offers many challenging relational problems (A. 
Karwath and R. D. King), in the spirit of the founding ILP application, i.e. the 
mutagenesis problem^. Other applications are concerned with medical control 
(R. Quiniou, M.-O. Cordier, G. Carrault, and F. Wang) and spatial data mining 
(D. Malerba and F. A. Lisi). 

The invited talks, one joint conference with LLL, given by D. Roth, Univ 
of Illinois, USA and one by H. T. T. Toivonen, Nokia, Finland, described the 
challenges a in two of the hottest fields for Machine Learning and ILP: Natural 
Language, and the Genome^. 

We wish to thank all researchers who submitted their papers to ILP 2001, 
all external referees whose kind help was very welcome, and the members of 
the Program Gommittee for their commitment to making ILP 2001 an open and 
lively high scientific venue. 



July 2001 



Geline Rouveirol and Michele Sebag 
Program Ghairs 



^ R.D. King, A. Srinivasan, and M.J.E. Sternberg, Relating chemical activity to struc- 
ture: an examination of ILP successes, New Gen. Comput., 13, 1995. 

^ Available at: http://www.lri.fr/ ilp2001/ 
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Liviu Badea 



AI Lab, National Institute for Research and Development in Informatics 
8-10 Averescu Blvd., Bucharest, Romania. 
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Abstract. Most implemented ILP systems construct hypotheses clause 
by clause using a refinement operator for clauses. To avoid the problems 
faced by snch greedy covering algorithms, more flexible refinement oper- 
ators for theories are needed. In this paper we construct a syntactically 
monotonic, finite and solution-complete refinement operator for theories, 
which eliminates certain annoying redundancies (due to clause deletions), 
while also addressing the limitations faced by HYPER’s refinement op- 
erator (which are mainly dne to keeping the number of clauses constant 
dnring refinement). 

We also show how to eliminate the redundancies due to the commuta- 
tivity of rehnement operations while preserving weak completeness as 
well as a limited form of flexibility. The refinement operator presented 
in this paper represents a hrst step towards constructing more efficient 
and flexible ILP systems with precise theoretical guarantees. 



1 Introduction and Motivation 

Although the research in Inductive Logic Programming (ILP) has concentrated 
on both implementations (e.g. |8l7j ) and theoretical results |4] (such as correct- 
ness, completeness and complexity), there is still a significant gap between these 
aspects, mainly due to a poor understanding of the combinatorial aspects of 
the search for solutions. In this respect, correctness and completeness results are 
necessary but not sufficient for obtaining an efficient learner. On the other hand, 
ad-hoc search heuristics might prove effective in certain cases, but the lack of 
theoretical guarantees limits their applicability and the interpretation of their 
results. 

With only a few notable exceptions (such as HYPER |3], or MPL m , most 
implemented ILP systems construct hypotheses clause by clause by employing 
a refinement operator for clauses. As shown by Bratko [3], such greedy covering 
algorithms face several problems such as: unnecessarily long hypotheses (with 
too many clauses), difficulties in handling recursion and difficulties in learning 
multiple predicates simultaneously. These problems are due to the fact that a 
good hypothesis is not necessarily assembled from locally optimal clauses. In fact, 
locally inferior clauses may reveal their (global) superiority only as a whole. And 
it is exactly this case (of mutually interacting clauses) that most implemented 
ILP systems do not deal with well. 



C. Rouveirol and M. Sebag (Eds.): ILP 2001, LNAI 2157, pp. 1- 1141 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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A solution to this problem would be to construct hypotheses as a whole 
(rather than on a clause by clause basis) by using a refinement operator for 
entire theories. But unfortunately, the combinatorial complexity of a refinement 
operator for clauses is high enough already, making a naive refinement operator 
for theories useless for all practical purposes. 

The main problem encountered when constructing a refinement operator for 
theories is its redundancy, which heavily multiplies the size of an already huge 
search space. Sometimes, a good search heuristic can compensate for the size of 
such search spaces. However, very often, the failure in coping with the required 
search is attributed solely to the weakness (or maybe myopia) of the heuristic em- 
ployed. In j2] we have argued that among the responsible factors for such failures 
one should also count the lack of flexibility of the refinement operator, its redun- 
dancy, as well as its incompleteness. While completeness and non-redundancy 
are desiderata that have been achieved in state-of-the art systems like Progol 
jZ], flexibility has hardly been studied or even defined in a precise manner. (A 
precise definition of flexibility of refinement operators for clauses was given in 
0 .) 

Flexibility becomes an issue especially in the case of (weakly) complete and 
non-redundant refinement operators, because redundancy is usually avoided by 
imposing a strict discipline on refinement operations, which usually relies on a 
predetermined (static) ordering of the literals, variables and even clauses. The 
resulting lack of flexibility can unfortunately disallow certain refinements, even 
in cases in which the search heuristic recommends their immediate exploration 
(These hypotheses will be explored eventually, but maybe with an exponential 
time delay.) The solution to this problem, proposed in [^, consists in enhanc- 
ing the flexibility of the clausal refinement operator by using a dynamic literal 
ordering, constructed at search time. In this paper, we show how to construct a 
flexible refinement operator for theories. 

Combining (weak) completeness and non-redundancy with flexibility has 
been studied in |^, but only for clausal refinement operators. Although maximal 
flexibility can only be achieved at the expense of intractability and exponential 
storage space, a limited form of flexibility can be achieved without significant 
additional costs, while preserving the completeness and non-redundancy of the 
refinement operator. This hints at a very general trade-off between (weak) com- 
pleteness, non-redundancy, flexibility and tractability0 

^ If we insist on (weak) completeness and non-redundancy, there is a fundamental 
trade-off between flexibility and tractability . For achieving non-redundancy, we have 
to store somehow a representation of the visited hypotheses space, so that every time 
a refinement of some H2 to some H' is considered, we can check that H' hasn’t been 
visited before. For tractability (of these checks), we cannot store a very fine grained 
representation of the visited space, so whenever visiting a hypothesis H we will store 
a coarse grained representation H of H. However, this will block (in the future) not 
only the rehnements leading to H, but also all those H' £ H that are indiscernible 
w.r.t. the coarse graining. This diminishes the flexibility of the refinement operator. 

A natural question is “why don’t we use the partial rehnement tree as a sort 
of index structure for the visited hypotheses space?” Although the depth of the 
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At the level of clauses, FOIL [8], for example, gives up completeness for 
maximum flexibility. But if the heuristic fails to guide the search to a solution, 
the system cannot rely on a complete refinement operator to explore alternative 
paths. On the other hand, Progol insists on completeness and non-redundancy 
at the expense of flexibility: some refinement steps are never considered because 
of the static discipline for eliminating redundancies. Finally, systems based on 
ideal refinement operators are complete and can be maximally flexible, but they 
are highly redundant. 

At the level of theories, both FOIL and Progol construct clauses one by one 
(Progol does this for ensuring non-redundancy and weak completeness). MPL 
|5] has more flexibility, but is incomplete. HYPER is less incomplete and still 
performs surprisingly well, even if the search heuristic were not perfect. This is 
mainly due to its avoiding an overly complex refinement operator such a^ 

Pt(T) = {(T\{C})Up(C) \cgt} 

by keeping the number of clauses constant during refinement: 

Ph{T) = {{T\ {C}) U {C'} \cgt,c'g p{C)}. (1) 

Therefore, HYPER has to start with theories containing multiple eopies of cer- 
tain very general clauses (corresponding to the predicates to be learned) . Thus, 
the main reason for HYPER’s success seems to be its avoidance of certain redun- 
dancies and combinatorial explosions by keeping the number of clauses constant. 
But there are still other redundancies, such as: 

— redundancies due to the commutativity of the refinement operations 

— redundant clauses within a theory (which are not removed in order to give 
them later the chance to be specialized). 

Keeping a constant number of clauses in theories during refinement is especially 
problematical when the number of clauses in the target theory cannot be easily 
estimated. If this number is significant, HYPER will rediscover fragments of 
the target theory over and over again without being able to reuse an fc-clause 
solution fragment in a larger n-clause theory (n > fc)l This also significantly 
increases the search time. Even worse, when learning theories for n predicates 

refinement tree is typically logarithmic in the number of visited hypotheses, searching 
for a given hypothesis, for example a clause with literals L1L2 . . -Ln, involves in 
general searching along n! paths (corresponding to all permutations of L1L2 . . . Ln). 
Of course, at most one such path will actually lead to our hypothesis (since the 
refinement tree belongs to a non-redundant operator), but the search along n! paths 
at each refinement step cannot be avoided and is intractable in practice. 

^ Theories T are viewed as sets of clauses C. p is a complete refinement operator for 
clauses. 

® In HYPER we also have redundancies between theories with different numbers of 
clauses, for example between Ti = T and T2 — T A Trest for a (very specific) Treat 
such that VC'2 £ Treat, 3Ci £ T with Ci >: C2. (This ensures that Ti ~ T2-) 
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pi,P2, ■ ■ ■ ,Pn (while allowing at most N clauses for each of them), HYPER will 
have to consider start theories. 

In this paper we present a refinement operator for theories that solves most 
of the above-mentioned problems: 

— it is complete and flexible (i.e. allows interleaving the refinement of clauses) 

— it can exploit a good search heuristic by avoiding the pitfalls of a greedy 
covering algorithm 

~ it doesn’t keep the number of clauses in a theory constant: it introduces new 
clauses exactly when these are needed 

~ it never deletes clauses (unlike MPL for example, where deleted clauses have 
to be marked to avoid adding them again later). 

2 Refinement Operators for Theories 

Refinement operators decouple the search heuristic from the search algorithm. 
Instead of the usual refinement operators for clauses, we will construct refine- 
ment operators for entire theories. For a top-down search, we deal with down- 
ward refinement operators, i.e. ones that construct theory specialisations. More 
precisely, we will consider refinement operators w.r.t. the subsumption ordering 
between theories. 

In the following, we will regard clauses as sets of literals (connected by dis- 
junction) and theories as sets of clauses (connected by conjunction). Clauses will 
be denoted by C, while theories by T (possibly with super/sub-scripts). 

Definition 1. Clause C\ subsumes clause C2, C\ Y C2 iff there exists a substi- 
tution 9 such that C \9 C C2 (the clauses being viewed as sets of literals). 

Theory T\ subsumes theory T2, T\ Y T2 (iff VC2 G T2. 3 C\ G T\ such that 
Cl Y C2. 

A hypothesis H (either a clause or a theory) properly subsumes H' , H H' 
iff H>H' and H' ^ H. 

H and H' are subsume-equi valent, H ^ H' iff H > H' and H' > H . 



Definition 2. A downward refinement operator for theories pt maps theories 
T to sets of theories subsumed by T: pt{T) C |T' | T Y T'}. 

Definition 3. A refinement operator p : HYP — >• 2^^^ is called: 

— (locally) finite iff p{H) is finite and computable for all hypotheses H. 

— proper iff for all H, p{H) contains no H' ~ H. 

— complete iff for all H and H' , H >- H' ^ 3 H” G p*{H) such that H" ~ H' . 

— weakly complete iff p*{Htop) covers the entire set of hypotheses HYP 
(Htop being the top hypothesis, for example the empty clause □ in the case 
of clauses, or the theory {□} containing the empty clause in the case of 
theories). 
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— solution complete (for theory refinement operators only) iff for all H >- H' 
such that H and H' cover all positives, 3H" S p*{H) such that H" ^ H' . 

— non-redundant iff for all and H, H G H p*(i?2) ^ Hi G 

P*{H2) or H2 G P*\Hi). 

— minimal iff for all H, p{H) contains only downward cot;er |3 CLn-d all its ele- 
ments are incomparable (H\,H2 G p{H) H\ ^ H2 and H2 ^ Hi). 

Refinement operators have a dual nature. On the one hand, they make syn- 
tactic modifications to clauses and theories. On the other, these syntactic mod- 
ifications have to agree with a semantic (generality) criterion (for a downward 
operator, the refinements have to be specialisations) . 

A refinement operator that never performs any deletions is called syntactically 
monotonic (however, such an operator may perform replacements). Syntactical 
monotonicity is important from a practical point of view since it avoids certain 
redundancies (the target of a deletion could also be reached without introducing 
the deleted element). 

Downward refinement operators for clauses operate by adding literals and 
are therefore syntactically monotonic. (Adding literals to clauses produces even 
more specific clauses.) 

However, adding clauses to theories makes these theories more general. Con- 
structing a syntactically monotonic downward refinement operator for theories 
(i.e. one that doesn’t delete clauses) is therefore not as simple as for clauses. 

Let p{C) be a finite and complete refinement operator for clauses. p{C) in- 
duces the following finite and complete refinement operator for theories: 

pj,{T) = {{T\ {C}) U p{C) \C GT} % refinement ( 2 ) 

U{T\{C}|( 7 gT} % (clause) deletion 

In other words, pT either replaces a clauses C GThy (the conjunction of) all its 
refinements p{C), or deletes a clause C G T. The latter alternative (clause dele- 
tion) is necessary for completeness, although it spoils the syntactic monotonicity 
of pt, making it highly redundant and therefore impractical. 

At this point, Bratko severely restricts the refinement operator to reduce 
its non-redundancy by keeping the number of clauses constant during refinement 
(see dTJ above). The resulting refinement operator is however incomplete. This 
leaves open the question of whether a syntactically monotonic and complete 
refinement operator can be constructed. 



3 A Syntactically Monotonic Refinement Operator 
for Theories 

In the following, we construct a syntactically monotonic, finite and solution- 
complete refinement operator for theories. (A solution-complete refinement op- 
erator may not generate all possible theories, but it will guarantee the generation 



^ H' \s & downward cover ot H iS H y H' and no H" G HYP satisfies H >- H" >- H' . 
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of all solutions, i.e. theories covering all positive examples and no negative ex- 
amples, while locally maximizing a certain covering heuristic.) 

To start with, note that pt{T) replaces a clause C by all its clause refinements 
p{C) . This produces in just a few refinement steps very long theories. These could 
be simplified by clause deletion, but it still begs for the following question: are 
all clauses from p{C) really necessary in the refinement of T1 

The answer is ’no’, especially when strict subsets p' of p{C) are capable of 
covering (jointly with T \ {C}) all positives. Indeed, even if (T \ {C}) U p{C) is 
in principle more general than (T \ {C}) U p' , the introduction of the redundant 
clauses p{C) \p' in the refinement of T seems unjustified, especially since it only 
increases theory size without improving its coverage (since (T \ {C}) U p{C) is 
more general, it could in fact cover more negative examples! On the other hand, 
both theories cover all positives [B) 

For obtaining minimal theories using a downward operator, we should there- 
fore only add the smallest subsets p' of p{C) that preserve the covering of all 
positives: 



Pt(T) = {T' = (T \ {C}) U p I for C G T and p' C p{C) minimal (3) 
(w.r.t. set inclusion) such that T' covers all positives} 

Considering a minimaji p' C p(C) instead of the full p{C) ensures that clauses 
which do not interact in (or jointly contribute to) covering the positives are not 
kept together, thereby minimizing the theory size (which is obviously important 
in learning). 

Normally, if the clauses Ci of p{C) = {C'i,...,C„| do not “interact”, we 
re-obtain HYPER’s refinement operator 

p'AT) = {{T\{C}) u {CJ I C e T, Q e p{C)}. 

However, in general, clauses Ci S p{C) do interact. These are the cases in 
which our new refinement operator increases the number of clauses in the theory. 
This is done only when the introduction of new clauses is necessary for preserving 
the coverage of all positives. 

For example, when refining a theory T (by refining one of its clauses C € T), 
the number of clauses will increase only if some Ci G p{C) is not capable of 
covering all positives by itself, so that at least some other Cj G p{C) is needed 
as well (see Figure [TJ: T >->• (T \ {C}) U {C^, Cj}. Obviously, replacing C by 
the more specific Ci A Cj may avoid covering some negative examples. 

The unrestricted refinement operator px makes minimal refinement steps 
(whenever it doesn’t perform deletions) since (T \ {C}) U p{C) is more general 
than (T \ {C}) U p' for every p' C p{C). 

® When refining theories using a downward operator, we can safely discard any the- 
ory not covering all positives, since downward refinements are specialisations and 
therefore will not be able to extend their coverage. 

® There can be several such minimal p' C p{C). 

^ together with T \ {C}. 
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Fig. 1. C is refined to Ci/\Cj . Refining C separately to Ci or Cj would spoil the coverage 
of all positives, making the introduction of a new clause in the theory necessary. 



However, the minimality of the refinement steps of px also involves a signi;^ 
icant increase in theory which is usually not justified by the examples]^ 

To make this observation more precise, we introduce a heuristic function for 
evaluating the merit of a hypothesis: 

/(T) = pos{T) - neg{T) - \T\ 

where |T| is the size of theory T, while pos{T) (respectively neg(T)) is the num- 
ber of positive (negative) examples covered by T. (/ is to be maximized. Since 
all theories constructed by p'x cover all positives, pos(T) = pos is a constant.) 

We also introduce the notion of “compression” realized by a theory T as 

k{T) = pos{T) - \T\. 

A solution T covers no negative examples (neg{T) = 0) and its size should 
be smaller than the number of positive examples covered: |T| < pos(T), i.e. 
k{T) > 0. (Note that the compression k{T) is an upper bound on the merit 
function f{T) < k{T), with equality only in the case of solutions.) 

Very frequently, the unrestricted px makes only very smal0 refinement steps 
and thus only increases the size |T| without modifying the coverage pos(T) — 
neg(T). This size increase would be justified only if the coverage would be im- 
proved. This is exactly what our improved p'j, does: it tolerates a size increase 
only if all the newly introduced clauses are necessary for covering all positives. 

More precisely, let p' be a minimal subset of p{C) such that (T\{C'})U/9' still 
covers all positives. Then adding any additional (redundant) clause C" G p{C)\p' 
to a refinement T' = (r\{C})Up' of T, i.e. considering T" = (r\{C'})Up'U{C"'}, 
will not only increase the size of the resulting theory: \T"\ > |T'|, but will also 
possibly increase the number of negative examples covered neg{T") > neg{T') 
(since T" is more general than T'), thus leading to a theory that is worse (w.r.t. 
the heuristic function) than the original refinement: f{T”) < f{T). 

® which makes the resulting theories impractical in jnst a few refinement steps. 

® There may be no difference in example coverage between (T \ {C}) U p{C) and 
(T \ {C}) Up'. In other words, while the first theory is intensionally more general 
than the second, the two theories can be extensionally equivalent, 
small w.r.t. the generality order. 
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3.1 Implementing the Syntactically Monotonic Refinement 
Operator for Theories 

The pseudo-code below, implementing p'j, (|2|, avoids generating all subsets of 
p{C) - it never generates supersets of the minimal subsets p' . This is realized 
by generating the subsets in increasing and by blocking the generation of 

supersets of the minimal subsets covering all positives. 

Subsets S' covering all positives are never added to the candidate list L, so 
we will never generate supersets of S' from S' itself. However, since supersets 
of S' could also be generated from sets that differ from S', we use nogoods for 
avoiding the generation of such supersets!^ 

Compute p't(T) 

MT) ■■= 0 

forall C e T 

L := 0; assume p(C) = {Cl, ..., C„} 
while L is non-empty 

extract the first S' = {ii,...,Zfe} from L 
for j = ifc -f 1, . . . ,n 
S' = SU{j} 

if fi nogood{S") such that S" C S' (*) 

if T' = {T\ {C}) U (Cij , . . . , Cij, } covers all positives 
add T' to p'x(T) % return T' 
add nogood(S') 
else append S' to L 
end if 
end for 
end while 
end for 

Instead of adding T' to p'j.{T), one could return T' (as a refinement of T) 
immediately and rely on backtracking to obtain alternative refinements. 

For p{C) = |Ci,...,C„}, the above algorithm computes the subsets p' C 
p{C) that are minimal w.r.t. set inclusion such that T' = {T \ |C}) U p' covers 
all positives. 

For exarnple, for p(C) = jCi, C2, C3}, if = jCi} and p'23 = {C2, C3} cover 
all positiveq^, but p'2 = {C2} and pg = {C3} do not, we will consider only 
the two theory refinements corresponding to and P23- Note that we will not 
consider the refinement p'12 = {Ci, C2} since it is not minimal w.r.t. set inclusion 
({Ci,C2} 3 jCi}, which covers all positives). See Figure E] 



i.e. we first generate the refinements for which p' is a singleton, then those where p' 
contains 2, then 3, 4, . . . clauses. 

The nogood test (*) can be efficiently implemented (using a tree-like representation 
for nogood sets), 
together with T \ |C}. 



A Refinement Operator for Theories 



9 



{Cl,C2,C'3} 




Fig. 2. Minimal subsets covering all 
positives. 



Cl 




Fig. 3. Cl is refined to C[ for avoid- 
ing the negatives — . C2 is introduced 
to cover the remaining positives -I-2, but 
only when it is needed (i.e. as a refine- 
ment of Cl, and not before refining Ci). 



Intuitively, considering {Ci, (72} (i.e. C\ A C 2 ) as a refinement would amount 
to considering a 2-clause theory containing C\ even if C\ covers by itself all 
positives. Now, it may be that a second clause C '2 will be needed later to preserve 
the coverage of all positives (after having refined C\ to a more specific C[ for 
avoiding negatives). However, this C '2 need not be introduced now - it will be 
when needed (e.g. when refining C\ to (7( A (7^ - see Figure EJ. 

Testing all subsets p' C p(C) for minimality may pose efficiency problems due 
to the large number of such subsets. However, due to the syntactic monotonicity 
of p'rp, the size of theories increases during refinement, while their compression 
decreases monotonically. This imposes an upper bound on the size of subsets 
p' that should be considered. More precisely, when replacing clause C by some 
subset p' C p[C) with n clauses, \T'\ = |T| — |(7|-l-n(|(7|-l-l), since the clauses C G 
p' are obtained from (7 by adding a literal. For obtaining a positive compression: 
0 < fc(T') = pos - \T'\ = k{T) + \C\ - n{\C\ + 1), 

k{T) - k{T') ^ 

- |C| + 1 ^ ^ 

The upper bound (|4]) on the size of subsets p' we have to consider is not very 
useful in the case of high compression rates. However, we can use it in a more so- 
phisticated implementation in which the subsets p' are subject to lazy evaluation 
(instead of being generated all at once) . 

More precisely, we can first construct only the refinements p' that guarantee 
a given (high) compression rate K {k{T') = and then gradually decrease 

K until a solution is found. 

Example 1. For simplicity, we consider a propositional example, where we can 
simply represent positive and negative examples for some predicate p as: 

+a, b + a,c —a — c 
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using 0 to impose an upper bound on the size n of the subsets p' . 
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(+a, 6 denotes a positive example ei, which would be represented in the usual 
ILP notation as p(ei). a(ei). 6 (ei). 

+a, c represents p{e2)- 0(62). 0(62). 

while —a denotes the negative example ~'p{ez). 0(63).) 

Figure m depicts the associated refinement tree for the starting theory Tq = 
{p 1—}. The refinements of clause p ^ are p{p 1—) = {p 1— a, p b, p ^ c} 
and the minimal subsets covering all positives make up the theory refinements: 
Ti = {p a} and T2 = {p b, p ^ c}. 

Then, when refining T\ with p(p •«— a) = {p •<— a, 6, p •<— a,c}, both refine- 
ments are needed for covering all positives thus producing theory T3 (which is a 
solution) . 

On the other hand, refining T2 produces T4 and T^, only T4 being a solution. 



To- p 




The search algorithm presented below uses a list of hypotheses (Theories), 
initialized with a starting theory (for example the one containing the empty 
clause) . 

solution search 

Theories : = [{□}] 
while Theories is nonempty 

extract T from Theories (according to heuristic /) 
if T is a solution then return T 
add p'x(T) to Theories 
end while 

The refinement operator p(C) for clauses used by p'j- works by adding to 
clause C either 

— a positive literal p(X\, . . . ,Xn) (with new and distinct variables) involving 
a target predicate p (in the case of Horn clauses, this is allowed only if C 
contains no other positive literal), or 

— a negative literal p(X\, . . . ,Xn) (with new and distinct variables) involving 
either a target predicate or a predicate from the background theory, or 

— a negative equality literal Xi = Xj involving variables Xi and Xj from C 
(for properness, Xi = Xj should not be deducible from the equality literals 
of C). 
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4 Reducing Redundancies 

Although p'rp eliminates certain redundancies of px, other redundancies still re- 
main. These are mainly due to the commutativity of the refinement operations 
(such as adding a literal to a clause), since all the permutations of a set of op- 
erations will now produce the same hypothesis. As shown in |ll2j . eliminating 
such redundancies amounts to destroying the commutativity of the refinement 
operations. This is equivalent to imposing a traversal discipline in the space of 
hypotheses and can be done by using order relations on literals, variables and 
clauses. As already hinted in the Introduction, a flexible refinement operator 
requires dynamic order relations (constructed at search time, rather than pre- 
determined). |2] deals with such flexible refinement operators for clauses. In 
the following, we show how to construct flexible refinement operators for theo- 
ries by extending the technique from [2j. Note that a straight-forward extension 
of the technique from |2] to theories would introduce order relations not only 
on variables and literals, but also on clauses. However, this would only allow 
constructing theories clause by clause, just like in implemented systems using 
refinement operators for clauses (like Progol or FOIL) . 

Example 2. Consider the 2 clause theory T = {Ci,C 2 } initially with Ci = a, 
C 2 = d, which we want to refine first by adding literal 6 to C\, then e to C 2 and 
finally c to C\. 

If we have separate (dynamic) literal and clause orderings, then adding b to 
Cl induces the literal ordering a < b, then adding e to C 2 induces not only d < e, 
but also the clause ordering Ci < C 2 . The latter ordering will now disallow a 
further refinement of Ci, such as adding c to Ci. (We could have obtained the 
desired refinement only if all refinements of Ci, i.e. adding b and c, would have 
preceded the refinements of C 2 . This reduces the flexibility of the refinement 
operator for theories and in fact we re-obtain the usual clause by clause covering 
approach.) 

To increase the flexibility of the theory refinement operator, we shall replace 
the two separate literal and clause orderings by a single order relation on the 
literals from all clauses. Thus, instead of ordering the literals within clauses and 
subsequently the clauses in their entirety, we introduce a finer grained ordering 
between the literals of all clauses. 

More precisely, the ordering will involve literal occurrences of the form L.id{C) 
(representing literal L from clause C) . Distinguishing the occurrences of the same 
literal in different clauses increases the flexibility of the resulting refinement op- 
erator. For example, refining T = {C\,C 2 } with C\ = a, C 2 = b by adding b 
to Cl and a to C 2 wouldn’t be allowed if we hadn’t made the above-mentioned 
distinction (since an inconsistent ordering a < b, b < a would result). With lit- 
eral occurrences, we obtain the consistent ordering a.l < 6.1, 6.2 < 6.1, 6.1 < a. 2 
(where id{Ci) = 1, id{C 2 ) = 2). 

However, this simple approach using literal occurrences L.id(C) only works 
whenever clauses are not “split” and therefore have a well-defined identity id{C), 
one that does not change under refinement (as for example in HYPER). We will 
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therefore show in the following how the redundancy of HYPER’s refinement 
operator can be reduced by adding an ordering ‘<’ between literal occur- 
rences L.id{C) and one ^~<id{C) between variable occurrences Xi. (See [2] for a 
justification of the treatment of the variable ordering.) 

The following refinement operator pu eliminates not only the redundancies 
brought about by the commutativity of the refinements of a given clause (as 
in [^), but also the redundancies arising from the commutativity of refinement 
operations on different clauses (of the theory being refined) 

r G ph{T) iff r = {T\ {C}) U {C} for C G T, C" G p{C, T), where 

C G p{C, T) iff either 

(1) C" = C U for some background literal 43 that occurs i — 1 times 

in C (i.e. . . . , G C) and such that adding the global constraints 

L^''\id{C) > 43 preserves the consistency of the global constraint store, 
vars(C') = vars{C) U vars{L^'''>), 

(where vars{L^^'^) are the variables of = p(Y^*4)> or 

(2) C" = C U {Xi = Xj} with Xi,Xj G vars{C), such that adding the global 
constraints 

(a) {Xi = Xj).id{C) > T and 

(bl) Xi -<id{c) Xj, Xk <id{C) Xj for each Xk such that (Y* = Yfc) G C or 
(Yfc = Y,) G 43 

(b2) Xj <id(c) Xi, Xk <id{C) X,, for each Xk such that {Xj = Xk) G C or 
{Xk = Y,) G C 

preserves the consistency of the global constraint store. 
vars{C) = vars{C) \ {Xi} if (b2) was applied, 
else vars{C) = vars{C) \ {Xj}. 

In both cases, (1) and (2), id{C) = id{C). 

pH adds either a new ordinary literal, or a new equality literal. The order 
relation on literals is constructed dynamically, as literals are added during suc- 
cessive refinements. (Of course, the consistency of the global constraint store 
needs to be preserved.) 

Since the order of clauses is important in Prolog programs, we do not attempt 
to eliminate the redundancies due to permutations of clauses (in the theory being 
refined), such as Ci A O 2 = Ci !\ C\. Eliminating such redundancies would involve 
technical complications that are outside the scope of this paper. 

For a literal L = p(Y) with variable tuple Y, we introduce a standardization for the 

(i) f.-. (i) 

variables X of the i-th occurrence = p{X ) of the literal L in some clause 



(the new and distinct variables Y are the same for the i-th occurrence of L on all 
alternative paths). 

Adding L' .id{C') > T to the global constraint store {L' being a literal, id{C') a 
clause identifier and T a theory) amounts to adding L' .id{C') > L.id{C) for all 
C £ T and all L G C (or, more practically, for all maximal L £ C £ T). 

We add Xi -<id(c) Xj only if such an Xi = Xk or Xk = Xi exists in C. 
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Special care has to be taken for allowing multiple occurrences of a given 
background literal L in a clause C. For ensuring the compatibility of the induced 
(literal and variable) orderings on the various alternative search paths, we have 

to use the same variable names X for the z-th occurrence = p{X ) of 
literal L on all paths. 

Adding equalities is trickier to a certain extent due to the transitivity of 
equality. First, we have to avoid the trivial redundancies that would appear if 
we allowed adding Xi = Xj for Xi and Xj already belonging to the same cluster 
of variables. (A cluster is a set of variables already unified with each other.) We 
do this by keeping in the set of variables candidates for unification vars{C) just 
one representative of each variable cluster. 

The constraints introduced at step (2h) ensure that a variable cluster Xi = 
X2 = ... = X„ can be generated with only one sequence of refinements of type 
(2), for example Xi = X2, followed by successively adding X3, X4, . . . , to 
the growing cluster. 

Example 3. For the theory T = {Ci,C2}, the following sequence of refinements: 

add literal a to Ci, add 6 to Ci, add c to C2, add d to Ci, add e to C2, 
add / to C2 

produces the literal ordering: a.l < h.l < c.2 < d.l < e.2 < /.2, which will 
disallow the re-generation of the same theory by a permutation of the above 
operations. 

Reducing the redundancies of our more general p'rp operator is even more 
complicated, mainly because of the difficulty in assigning identities to clauses 
obtained by “splitting”. Due to space limitations, it will be the subject of a 
separate paper. 

5 Conclusions 

The refinement operator for theories p'rp presented in this paper represents a first 
step towards constructing more efficient and flexible ILP systems with precise 
theoretical guarantees. 

Its main properties are syntactical monotonicity, solution completeness and 
flexibility. Flexibility allows interleaving the refinements of clauses, and thus ex- 
ploiting a good search heuristic by avoiding the pitfalls of a greedy covering 
algorithm. On the other hand, syntactical monotonicity is important for elimi- 
nating certain annoying redundancies due to clause deletions. 

We also show how to eliminate (for HYPER’s refinement operator) the re- 
dundancies due to the commutativity of refinement operations while preserving 
a limited form of flexibility. 
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The paper also deals with theory refinement, but the main focus is on 
other aspects, such as constructing bottom theories 0 Unfortunately, the paper 
has several problems, which, for lack of space, cannot be discussed here. 
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Abstract. First-order theory refinement using neural networks is still an open 
problem. Towards a solution to this problem, we use inductive logic programming 
techniques to introduce FOCA, a First-Order extension of the Cascade ARTMAP 
system. To present such a first-order extension of Cascade ARTMAP, we: a) mod- 
ify the network structure to handle first-order objects; b) define first-order versions 
of the main functions that guide all Cascade ARTMAP dynamics, the choice and 
match functions; c) define a first-order version of the propositional learning al- 
gorithm to approximate Plotkin’s least general generalization. Preliminary results 
indicate that our initial goal of learning logic programs using neural networks can 
be achieved. 



1 Introduction 

The Cascade ARTMAP system lITTI is a knowledge-based neural network (KBNN) IHITI . 
like KB ANN E2I, RAPTURE fTTH . and C-IL^P 0, that has been shown to outperform 
other purely analytical or inductive systems in the task of propositional theory refinement: 
a prior incomplete and/or partially correct propositional symbolic knowledge about a 
problem domain is given to a theory refinement system and it is revised by training the 
system with examples. 

Three main advantages of Cascade ARTMAP over other KBNN systems are: a) the 
initial rule structure of the network created by the insertion algorithm is preserved during 
training (this is a major problem when using backpropagation training for the task of 
theory refinement), which facilitates rule extraction by allowing a direct comparison of 
the extracted rules to the originally inserted rules; b) it is an incremental learning system; 
c) it combines instance-based learning with rule inductiorQ. 

All these characteristics are inherited from the fuzzy ARTMAP system Q, of which 
Cascade ARTMAP is an extension that allows the representation of intermediate at- 
tributes of rule-based knowledge and multi-step inference (rule chaining). 

Inductive Logic Programming (ILP) fTOll augments the expressive power of inductive 
learning and theory refinement tasks to (first-order) logic programming ||23l . This has 
allowed ILP systems to handle problems like mutagenicity, carcinogenicity, drug design, 
language learning IITSlI . and benchmark problems like The east- west train problem |[T2l . 



* This characteristic is shared by the Rule Induction from a Set of Exemplars (RISE) system Q- 

C. Rouveirol and M. Sebag (Eds.): ILP 2001, LNAI 2157, pp. 15-[26l 2001. 
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Most ILP systems, such as FOIL CTl, GOLEM (H, TIM and PROGOL IHIl, use 
a covering algorithm to generate hypotheses. However, as pointed out hy Bratko (whose 
HYPER is a recent exception ||3)), systems that use a covering algorithm have difficulties 
in learning multiple predicates and recursive dehnitions, and will have unnecessarily long 
clauses for hypotheses. Cascade ARTMAP, by contrast, uses a non-covering algorithm 
to generate each hypothesis as whole, like other KBNN systems. 

Our main goal is to combine ILP techniques with Cascade ARTMAP, dehning a 
unihed hybrid first-order KBNN system, named EOCA, which takes advantage of its 
parent methodologies while attempting to overcome their limitations. In order to present 
such a first-order extension of Cascade ARTMAP, we: a) modify the network structure to 
handle first-order objects; b) define first-order versions of the choice (similarity) function 
and of the vigilance criterion (functions that guide all Cascade ARTMAP dynamics); 
c) dehne a first-order version of the learning algorithm (the weight update), which is 
related to |B1 and approximates Plotkin’s least general generalization (Igg) ff^ . 

Related work includes the following. The ILP systems GOLEM and TIM are also 
bottom-up systems that use and approximate Igg, but not combined with incremental 
clustering. The work in [Tj uses the ILP system LINUS 1101 that transforms a hrst-order 
learning task to attribute-value form, and applies a neural network as its attribute-value 
learning algorithm. This can only be done to a restricted class of problems. The work in 
0 presents a hrst-order KBNN based on radial-basis networks. The learning algorithm 
only deals with the numeric part of the theory, which has no rule chaining and is recursion- 
free. The work in Q shares the same objectives with this work, but the learning part is 
still being developed and uses a backpropagation-based KBNN. 

The remaining sections of this paper are organized as follows. Section 2 presents 
an introduction to Cascade ARTMAP, Section 3 describes the EOCA system. Section 
4 presents preliminary experimental results, and Section 5 contains conclusions and 
directions for future work. 



2 Cascade ARTMAP 

We only review Euzzy ARTMAP [|4j, since Cascade ARTMAP is an extension of it 
that allows rule insertion and rule cascading (chaining), and these are done similarly in 
EOCA. 

Fuzzy ARTMAP (see Fig. [1]) incorporates two Fuzzy ART modules, ART^ and 
ARTf), which are linked together via an inter- ART map held F“^. Each ART module 
is an unsupervised clustering system. We can view each example presented to Euzzy 
ARTMAP as a rule, the input part being the body (antecedent) and the output part 
(class) being the head (consequent) of the rule. ARTq receives the examples’ bodies and 
constructs a clustering scheme, aggregating similar bodies in the same cluster (a node in 
the p 2 layer). Each cluster has a prototype (a weight vector Wj connecting cluster j to 
all nodes in E“) that is the generalization of the bodies that belong to that cluster. ARTb 
receives examples’ heads and similarly constructs a clustering scheme in E^. The map 
held links each category formed in each ART module consistently. Each Euzzy ART 
module has three layers: 
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map field F 



ab 




Fig. 1. The Fuzzy ARTMAP system (4) 



1. Fq, with the same number of nodes as the module’s input vector (unless we use a 
normalization scheme like complement coding, as in Fig.|T]). Vector I denotes the 
activation of Fq (A in ART^ and B in ART;,); 

2. Fi, with the same number of nodes as Fq. The nodes in these two layers are linked 
together by one-to-one weights. Vector x denotes the activation of Fi; 

3. F 2 , which is fully connected with Fi. The most important weights in the ART 
network are the weights between one node j in F 2 and all nodes in Fi, denoted by 
yVj , because they contain information about the prototypes of the network. Vector y 
denotes the activation of F 2 . 

The dynamics of Fuzzy ART depends on a choice parameter a > 0, a learning rate 
parameter j3 G [0,1], and a vigilance parameter p G [0,1]. For Fuzzy ARTMAP, there 
are two additional parameters: the minimum value of the vigilance parameter of ART^, 
called the baseline parameter ~p^, and the vigilance parameter of the map held, pab- 
We can view the Fuzzy ARTMAP dynamics as a propositional inductive learning 
system: for each example presentation (see step 1 of the algorithm below), the most 
similar rule already encoded in the Fuzzy ARTMAP (steps 2 and 3) satisfying the 
vigilance criterion is chosen to generalize with this example (step 4a); if there is no 
rule satisfying the criterion, a new one is created from the example. 

The following is the Fuzzy ARTMAP learning algorithm for a set E of examples. 

Algorithm 

For each example e = (a, h) G E, do: 

step 1 (Input presentation): Let pa = /D, A = a, B = 6, (here a normalization scheme 
is applied as in Fig.lU, x“ = A and x^ = B. 
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step 2 (Category selection): 2.1) For each ART module, calculate the choice function 
Tj(I) (the degree to which yvj is a subset of the input I) for each node j in F2, and select 
the node J that has the greatest value. The choice function Tj(I) is defined by: 



T,(I) 



IIAWjI 

a+ |wj| 



where input I is A in ART^ and B in ART{,, A is the fuzzy intersection defined by 
(p A q)i — min(pi, qi), and the norm |.| is defined by 



IpI = 5IlP*l 

i 



2.2) For each node pre-selected in 2. 1 ( J in F2 and K in F2), calculate the match function 
nij (I) (the degree in which the input is a subset of the prototype Wj), defined as 



If the match function for node J and/or K is greater than or equal to p (vigilance criterion), 
we say that resonance happens and then the respective F2 layer is activated; for ART^ : y j 
= 1 and y^ = 0 for j ^ J (winner-take-all activation, and similarly for ART{,). Otherwise, 
shutdown the actual chosen node (the node cannot be selected during the presentation 
of the actual example), and go to 2. 1 to select a new node in the respective module. If no 
category can be chosen, let J and/or K be the new nodes created dynamically, with w“ 
= 1, Wj** = 1, y“ = 1, and y “ = 0 for j ^ J, and/or, = 1, y^ = 1, y^ = 0, for fc 7^ AT. 

step 3 (Verification in the map field) : In the map field, x®*” = A y^ . If | x®*” | / 1 y ^ | > pah 
then go to step 4a, otherwise go to step 4b. 

step 4a (Learning): Let 

a(new) o /t a a(old)\ , ^ \ a(old) 

Wj ^=/3a(lAwy 0-f(l-/3a)wy 



and Wj^ will be updated : 

^abinen,) ^ ^ab ^^b ^ ^ab(old)^ ^ _ ^ab^^ab(old) 

(fast learning corresponds to setting /3=1). 

step 4b (Match Tracking): Let pa = m}(A) -I- e, Tj = 0, and go back to step 2. 



3 The FOCA System 

A logic program is a set of clauses: h ^ &i, . . . , where h is an atom and 61 , . . . , 
are literals (positive or negative atoms). The general learning framework ifTsl from ILP 
is: given an initial domain theory (background knowledge) B, a set of positive examples 
, and a set of negative examples E~ , the goal is to build a hypothesis H, where B, 
E^ , E~, and H are logic programs satisfying the following conditions: 
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- Necessity: B Y= (B does not cover i?+); 

- Sufficiency: B A H \= S+ (B and H cover i?+); 

- Weak Consistency: B AH false; 

- Strong Consistency: B A H A E~ Y= false. 

In systems that handle noise, sufficiency and strong consistency are not required. 

Generally in ILP, an example is a ground instance p(ci, . . . , c„) of a target concept 
p(Vi, . . . , Vn) and its description is logically encoded in B. The hypothesis space is 
generally ordered by 0-subsumption. Some ILP bottom-up (specific to general) systems 
search the hypothesis space using the relative least general generalization operator for 
two clauses (rlgg(ei, 62 )) ifT&l . which can be calculated in two steps first calculate 
the saturation (bottom-clause) of ei and 62 , J-i and J_ 2 , respectively, and then calculate 
igg(±i, ± 2 ) m. 

The FOCA system is a relational bottom-up system that works with clauses (the 
bottom-clauses) as examples, clustering separately the bodies and the heads of those 
clauses, and makes a consistent linking between body and head clusters. Each bodies or 
heads cluster is represented by a prototype that 0 -subsumes all elements in that cluster 
and holds the information necessary to calculate the pertinence of a new element in the 
cluster. FOCA also uses the same language bias of PROGOL, like mode declarations, 
to restrict the size of a bottom-clause m- 

Before presenting the FOCA algorithm in Section 3.4, we describe its architecture 
in Section 3.1, the new choice and match functions in Section 3.2 and the Igg-based 
operator that we use in Section 3.3. 



3.1 Architecture 

The architecture of the FOCA system is shown in Fig. |2] 




Fig. 2. Overview of the FOCA system 
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In Cascade ARTMAP, information flows through the links as numbers. In FOCA, 
this information is a set of tuples of terms, each tuple having a number associated to it. 
The bottom-clause’s body is the input to ART^ and its head is the input to ART^. 

Each neuron in Fi represents a predicate symbol of the problem. Let predicate{k) 
be the predicate symbol that neuron k represents. The activation of each neuron k in 
Fi is the set A^, consisting of tuples of terms of predicate(k) that appear in the input 
of an ART module. For instance, for the input I = {f ather(Z, S), f ather(Z, D), 
mother((5, S), mother((5, D)} we have = {{Z, S),{Z, D)} and A^ = {(Q, S), 
{Q, D)}, where predicate(k) = father and predicate(j) = mother. 

Let the union of all Afc with the respective predicate symbol applied to each tuple be 
X = {p(Vi, . . . , Vn) I (Vi, . . . , Vn) G Afc, where predicate(k) = p, for each k G Fi} 
(X is X“ in ART(j and X*” in ART^). In Cascade ARTMAP, the vector x“ represents 
a working memory: the input and all attributes activated by the rule chaining. Here, 
similarly, the set X“ will also hold all literals inferred during the rule chaining. Steps 
4b and 5a of the algorithm show how X“ is updated with the literals activated in rule 
chaining. 

Each cluster j in the F 2 layer has a set Wjk associated to each link with a neuron 
k in Fi. Wjk holds all tuples of terms from predicate(k) that belongs to the relational 
prototype, defined as W_, = {p{ti, . . . , f„) | (fi, G where predicate(k) = 

p, for each k G Fi }. If a cluster j encodes more than one example, Wj can be viewed as 
an approximate rlgg of all saturations of examples encoded by j. Associated with each 
literal p there is a number, named Fuzzy{p) G [0,1], that represents its fuzzy information. 
Initially, if a literal does not have such a number the default value is one. 



3.2 The First-Order Choice and Match Functions 



The choice and match functions were developed based on d , but having the goal that 
in the propositional case they be reduced to their propositional versions. 

Given a category J in F 2 and the set X“, the first-order equations of choice and match 
functions are defined below 



T,(X“) 



Fuzzy{w) A Fuzzy{x) 

{w,x)£lNTwx 

Fuzzy{w) + a 

{w ,x)^I NTw X 



( 2 ) 



and 



m,(X“) 



X] Fuzzy{w) A Fuzzy{x) 
{w,x)£l NTwx 

Fuzzy{x) 



{w,x)g 1 NTwx 

where INTwx C Wj x X“ is the relational intersection, defined as 

INTwx = {(w,a;) | m G mop,w = m9w,x = mOx} 



( 3 ) 



with 



mop = OC\ P 
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such that 

max \0 n PI, 

where OOnj = Wj, PO^ = X“, 0^ and 0^ being renaming substitutions. If we have more 
than one candidate for INTwx, we choose arbitrarily. The choice and match functions 
in are calculated similarly. 

3.3 Learning 

For a cluster J in ART^ and the working memory X“, the learning operator is defined 
as 

W, =a/gg(W„X“), (4) 

where the operator algg is defined as 

algg{^j,'^‘') = {lgg{w,x) I {w,x) e INTwx}- 

The algg operator, unlike Igg, is dependent on the order: two different sequences of 
calculations can produce two different clauses. This learning definition only supports 
the fast learning setting, because we eliminate every literal in Wj that is not in INTwx ■ 
The fuzzy information is updated by 

Fuzzy{w) = (3 mm{Fuzzy{w) , Fuzzy{x)) + (1 — /3)Fuzzy{w), (5) 

where (w,x) G INTwx the prototype set and fuzzy information are also updated 

similarly. 

3.4 Algorithm 

The algorithm of FOCA, shown below, is similar to the Cascade ARTMAP algorithm. 
The notion of the ^/>( J) set is defined similarly to H21I . 

For each example e G E, do 

step 1 (Bottom-Clause generation): Calculate bottom-clause J_e from example e. The 
body of J_e is presented to ART^ and the head of J_e to ARTf,. 

step 2 (Show J_e): Let pa = ~Pa- In FJ, Let X“ be the body of J_e. 

step 3 (Cluster selection): 3.1) Calculate T“(X“) (eq.El) for each node j in F^. 3.2) Let 
J be the node chosen in step 3.1. If J passes the vigilance criterion m“(X“) > Pa, go to 
step 4b else shutdown the chosen node and go to step 3.1 to choose a new node. If no 
category can be chosen in Fg, go to step 4a. 

step 4a (New node): Let J be a new node created dynamically. In ART;,, X^ is the head 
of J_e. 4al) Calculate the choice function for each node k in F^ and let K be the node 
chosen. 4a2) If K pass in the vigilance criterion m^(X^) > pf, then go to step 4a3 else 
shutdown K and go to step 4al to choose a new node. If no category can be chosen in 
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F 2 , let Khe& new node in F^. 4a3) Each node in '(/>( J) and node K in F 2 will be updated 
like defined in Section 3.3 using eqs.|4]and 0 The weight is updated using eq.|TJ 

step 4b (Inference): and the map field is activated. The one-to-one weights 

between F®*” and F 2 activate F^. For each node fc in F 2 , the choice function is such that 
Tfc = The winning node K is such that = max{T^ | k S F 2 }. When K is 
chosen, = 1 and = 0 for each k ^ K.X^ = where 9 — 9~^9x, and 6^, 9^ 
are the renaming substitutions used to calculate INT\yx for the cluster J in F 2 in step 
3. If holds a target atom then go to step 5b, else go to step 5a. 

step 5a (Update): Let X“ = X“ U X*”. Go to step 3. 

step 5b (Matching): Let X^ be the head of J_e and calculate m^(X*'), imposing for the 
variables in the body that appear in the head the same substitutions in 0^, and 9x used to 
calculate I NTwx for the cluster J in F 2 in step 3. If m^(X^) > pi, then go to step 6a, 
else go to step 6b. 

step 6a (Resonance): Update each activated node in i/;(J) and K in F 2 like defined in 
Section 3.3, using eqs.|4]and 0 

step 6b (Mini-Matcb Tracking): Let pa = m“ (X“)-|-e, where m“(X“) = {min(m5;(X“) 

I k G '0( J)}, T“ = 0. Go back to step 3. 

Example. We present a simple trace of the algorithm above for the east-west train 
problem IITTI . Consider that we have already presented two examples. Therefore, there 
are already in each module two clusters: clusters 1 and 2 in ART^ associated with clusters 
1 and 2 in ART},, respectively. The F 2 layer of each ART module is shown in table[Tl 



Table 1. F 2 and F 2 Layer 



F^ Layer (ARTa) 


Cluster 1 


Wi = {hasjcar(X,V), open(V), long(V),shape(V, rectangle), 

hasxar(X,W), not long(W), not open(W), shape(W, rectangle), 
has_car(X,Z), not long(Z),open(Z),shape(Z, rectangle)} 


Cluster 2 


W 2 = {has-car(A,B), not open(B), long(B),shape(B, rectangle), 

has_car(A,C), not long(C), open(C), shape(C, rectangle)} 



F^ Layer (ARTt) 

Cluster 1 W\={eastbound(X}] 

Cluster 2 W 2 = {not eastbound(A)} 



Let ~p^ = Q, Ua = OLb = 0.001 and pab = Pb = L Now consider the presentation of 
the following example e = eastbound( eastS ) in step 1 . The bottom-clause J _3 generated 
is: 
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J- 3 ={eastbound(E) •<— has-car(E,F), not long(F), open(F), shape(F, ushaped), 
has-car(E, G), open(G), not long(G), shape(G, bucket), 
has-car(E, H), not open(FI), not long(Fl), shape(H, rectangle)} 

In step 2, X“ is the body of J_ 3 . In step 3, we calculate the choice and match functions 
for each node; 

For cluster 1, the relational intersection is 

INT\yx = {(hasj:ar(X,V), has_car(E,F)), (open(V), open(F)), 

(has-car(X, Z), hasxar(E,G)), (open(Z), open(G)), 

(not long) Z), not long(G)), (haS-Car(X, W), hasxar(E, H)), 

(not open(W), not open(H)), (not long(W), not long(H)), 

( shape) W, rectangle ), shape) El, rectangle )} 

using substitutions Oyj ={E/X, FI/W,G/Z, FN} and 9^ = 0. Since Fuzzyiw) = Fuzzyix) = 
1, for each {w,x) e INTwx, then T^(X“) = 9/(12 + a) Ri 0.75 and mJ(X“) = 
9/12 = 0.75. 

For cluster 2, the relational intersection is 

INTwx = {(has-car(A, C), hasxar(E,G)), (open(C), open(G)), 

(not long(C), not long(G)), (hasxar(A, B), hasxar(E, FI)), 

(not open(B), not open(H)), (shape(B, rectangle), shape) H, rectangle)} 

using substitutions = {E/A,H/B,G/C} and 9^ = 0. 

Then T^(X“) = 6/(8 + a) 0.75 and m^(X“) = 6/12 = 0.5. 

Therefore, node 1 is chosen and passes the vigilance criterion (step 3.2). In step 4b, 
we calculate the choice function in F^: X"** = w“^=(I0), T^= 1 and T 2 = 0. Then node 
1 in F 2 is chosen and X^ = N[\9~^9x = {eastbound(E)}. X^ holds a target atom, so we 
go to step 5b, and calculate 

INTwx = {{eastbound(X),eastbound(E))} 

using substitutions 9^, = {E/X, H/W,G/Z, FAI} and 9x = 0. Then, m 3 (X^) = 1/1 = 1. 

In step 6a, the nodes I in each module are updated using eqs.|4]and The final state 
of the F 2 and F 2 layer is shown in table[^ 



3.5 Performance Evaluation 

To classify a test example, we only execute steps 1, 2, 3 (in case it needs to go to 4a, we 
stop and return ’no’), 4b, 5a, 5b (here, if m^(X^) > ph, then we return ’yes’ else ’no’) 
of the algorithm. 

4 Preliminary Experiments 

The system has been tested on some machine learning domain problems. The first three 
problems are defined in lITOI : learning the concept of an arch, family relationships, and 
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Table 2. Final state of the and F| layers 



Fg Layer (ARTa) 


Cluster 1 


Wi={has_car(M,N), open(N), 

has_car(M,0), notlong(O), not open(O), shape(0, rectangle), 
has_car(M,P), not long(P), open(P)} 


Cluster 2 


W 2 = {has-car(A,B), not open(B), long(B),shape(B, rectangle), 

has_car(A,C), not long(C), open(C), shape(C, rectangle)} 



F 2 Layer (ARTt) 

Cluster 1 Wi={eastbound(M)} 

Cluster 2 W 2 = {not eastbound(A )} 



the Eleusis problem (layouts 1, 2 and 3). The fourth problem is the east-west train 
problem with 10 examples |T2ll . They do not have a test set and, for each problem, the 
theory extracted from FOCA is equal the correct known theory. 

The Illegal KRK endgames problem rT0|| has a test set. We use the same experimental 
methodology for the purpose. The rules extracted from FOCA yielded 97.92% accuracy, 
with 0.005% of standard deviation. Other ILP systems’ performances are presented for 
comparison in table 



Table 3. Results on the KRK problem 



SYSTEMS 


ACC. (5 Sets of 100 Exs.) 


FOIL 


90.8% sd 1.7% 


LINUS-ASSISTANT 


98.1% sd 1.1% 


LINUS-NEWGEM 


88.4% sd 4.0% 


FOCA 


97.92% sd 0.005% 



5 Conclusions 

This work contributes to bridge the gap between symbolic and connectionist learning 
systems. A hrst-order extension of the neural network Cascade ARTMAP, the FOCA 
system, was presented, by extending to first-order the structure of the network, the choice 
and match functions, and the learning algorithm. Since (Fuzzy) ARTMAP is composed 
of two (Fuzzy) ART modules, this work can be also be seen as presenting a hrst-order 
(Fuzzy) ART, a hrst-order clustering system. 
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The preliminary experimental results show that FOCA can learn first-order theories 
as an ILP system. Nevertheless, they have not yet explored FOCA’s main advantage 
as a first-order theory refinement system. We are now in process of applying FOCA 
on such problems and other real-world ILP applications: mutagenicity, drug design, 
language learning M, and character recognition Il24l . This could not be done because 
the relational intersection defined here is intractable in general case. We plan to enhance 
FOCA with stochastic matching |[T9l to solve this problem. 

Like the ART family, FOCA is also sensitive to the order of example presentation. 
We can overcome this problem by using an ensemble, as in f4l . 
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Abstract. Nowadays, propositionalization is an important method that 
aims at reducing the complexity of Inductive Logic Programming, by 
transforming a learning problem expressed in a first order formalism 
into an attribute-value representation. This implies a two steps process, 
namely finding an interesting pattern and then learning relevant con- 
straints for this pattern. This paper describes a novel genetic approach 
for handling the second task. 

The main idea of our approach is to consider the set of variables ap- 
pearing in the pattern, and to learn a partition of this set. Numeric 
constraints are directly put on the equivalence classes involved by the 
partition rather than on variables. We have proposed an encoding for 
representing a partition by an individual, and general set-based opera- 
tors to alter one partition or to mix two ones. For propositionalization, 
operators are extended to change not only the partition but also the 
associated numeric constraints. 



1 Introduction 

Propositionalization (LINUS |9], STILL jl^, REPART jl7], [I], •••) enables to 
restrict the search space of traditional Inductive Logic Programming (ILP) sys- 
tems, by limiting the number of relations that could appear in a rule to those 
defined in a pattern. This pattern has to be carefully chosen to allow the dis- 
covery of an interesting rule; it can be learned or given by the user. The system 
has to specialize it by learning either symbolic constraints or numeric ones on 
its variables. Solving an ILP problem by propositionalization thus implies a two 
steps process, namely finding an interesting pattern and then learning relevant 
constraints for this pattern. We are interested in the second step, and we propose 
an approach based on a Genetic Algorithm (GA) to realize it. Our GA performs 
generalization and specialization, but also operations that enable to explore new 
points of the search space. This avoids the drawback of some deterministic meth- 
ods that can explore unfruitful path without backtracking. 

At first, we have concentrated our work on learning only equality relations 
between variables occurring in the pattern jl]. The idea is to consider the set of 
variables and to learn a partition of this set defining which variables are equal. 
For this purpose, we have proposed an encoding for representing a partition 
of a set of variables by an individual, and several set-based operators to alter 
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a partition or to mix two ones. The resulting GA goes beyond the scope of 
propositionalization: it enables to learn genetically an optimal partition of a set 
of objects, and can be applied to other problems that can be reformulated as 
learning a partition. 

In this paper, the approach is extended to achieve other propositionalization 
requirements, namely learning numeric constraints on variables at the same time 
as equality relations between them. Numeric constraints are thus directly put 
on equivalence classes involving numeric variables, and the genetic operators are 
extended to change not only the partition but also the associated constraints. 

This paper is organized as follows. Section 2 is dedicated to the presentation 
of our approach, in particular the encoding we propose to represent a partition 
of variables by an individual, and a set of operators to generate new individuals. 
Then Sect. 3 gives some interesting properties of these operations for learning 
a rule. Sect. 4 presents preliminary experiments. In Sect. 5, some related works 
are presented; and finally conclusions are given in Sect. 6. 

2 Prom Learning Rnles to Learning Partitions 
with Constraints 

Let us recall that in propositionalization, a pattern expressing the general form 
of the rule we are looking for is given. 

Let us now recall that GAs are stochastic optimization algorithms that make 
evolve a population of individuals representing potential solutions for the prob- 
lem at hand. This is done by selecting good individuals in the current population 
and applying them genetic operators. The basic algorithm works as follows: 

initialize the population 
do 

evaluate population 
select good individuals 
apply genetic operators 
until end criterion 

This requires to find a suitable encoding to represent a potential solution as an 
individual in a way that enables potentially good solutions to be crossed. 



2.1 The Pattern 

In our work, the pattern is a Horn clause, in which all occurrences of variables 
are distinct (it is thus the most general clause based on the given relations), 
except when some equality relations are fixed at the beginning. For the time 
being, our approach allows to learn equality relations between variables occurring 
in the pattern, and minimum and maximum thresholds on the values of the 
numeric ones. The equality relations involve a partition of the set of variables 
into equivalence classes. An individual encodes such a partition and constraints 
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on numeric variables are directly put on the equivalence class C to which they 
belong. 

There may exist or not constraints on the minimum and maximum values of 
a class of numeric variables. 

2.2 Individuals Encoding 

Let us denote by 5 = {Xi, . . . , X„} the set of variables occurring in the pattern. 

As already stated, an individual, in our approach, encodes a partition of S, 
together with the constraints put on numeric classes. 

Each class is identified by a label. Our encoding represents an individual as an 
array Ind indexed from 1 to n, where n is the cardinality of S. For i S {1, . . . , n}, 
Ind[i] is associated to the variable Xi of S and gives information about the class 
this variable belongs to in the partition (label, minimum value, maximum value). 
For labelling a class, we consider the variables it contains and take the lowest 
index among them. This rule ensures the uniqueness of the representation of a 
partition by an individual. Our search space is thus restricted to valid and non 
redundant potential solutions. 

Figure [T] shows an example of the representation of a rule by an individual, 
based on a given pattern. In this example, R and S are two relations of arity 2; 
A 4 and Xq are numeric variables of the same type, while the others are symbolic 
ones with the same type. 

Let us notice that encoding partitions as strings of group numbers has already 
been done in the field of clustering, and this is known as group-number encoding 
|H]. Nevertheless, as far as we know, none of the works in this field imposes that 
a group number is the lowest index among the objects it contains. 



pattern 



coding pattern 



R{Xi,X2) ^ 5(^3, X 4 ) A 5(^5, Xe) 
X1X2X3X4X5X6 



a potential solution 
based on the pattern 



the corresponding 
individual 



equivalence classes 



R{Xi,X2) ^ 5(X2,X4) A5(Xi,X4),X4 > 12 



1 2 3 4 5 6 




Fig. 1. An example of encoding 
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2.3 Operations on Equivalence Classes and Numeric Constraints 

Let us now describe the genetic operations we have defined. For each operation, 
we describe its effects on the partitions coded by the individuals concerned. A 
more detailed description of how it is achieved in the more general context of 
learning a partition of a set of objects is given in j4]. 

We focus in this paper on the application of our approach to propositio- 
nalization, and to the extension to deal with numeric constraints. 

Let us note that 

— variables are typed, 

— all the variables of an equivalence class must have the same type, 

— a class has the type of its variables. 

Two classes involved in an operation must have the same type. Moreover, 
in case of classes with numeric variables, numeric operations are applied to put 
appropriate constraints on them. In most cases, several treatments are possible 
to modify constraints on the numeric classes concerned by a genetic operation, 
or to set some on the new ones. In the next section, we will study the effects 
of these operations in terms of generalization or specialization of the underlying 
rule. 

For clarity purpose, we only consider numeric variables with the same type 
in the illustrating examples, so that all classes can be mixed. 



Operations Involving One Individual. 

Let us denote by Ind the individual concerned, Ind' its child, by C the class of 
Ind on which the operation is performed, and in case of a numeric class by mine 
and maxc its bounds. 

On the figures, we denote by C/ a class labelled by I in the parent and by 
a class labelled by I in its child. 

Let us notice that when there is no lower bound mine = —oo (or 0 for 
positive values), and when there is no upper bound maxc = +oo. 

Isolating a Variable. A variable V is selected, V £ C. V \s removed from C 
and put alone in a new class C in Ind' , while the rest of the partition is kept 
unchanged. 

For this operation, if E is a numeric variable, two possibilities can be considered 
to put constraints on C \ 

— constraints on C are propagated to C {mine = mine and maxc = maxc)', 

— constraints on C are fired at random. 

In Fig. ^ V = A3 is removed from the class C3 of Ind. This leads to two 
classes C3 and C4 in Ind' , with C3 containing only A3 and C4 the other variables 
of C3. In this example, the constraint on C3 is obtained by propagating this of 
C3 (treatment 1). 



A Genetic Algorithm for Propositionalization 



31 



Cl C2 



C3 



parent 


1 2 3 3 3 


2 




^2^6 


X3X4X5 




I*'- 


^3 


> 1 




< 16 

a c'. 


child 


1 2 3 4 4 


2 




X2X6 


X3 X4X5 








> 1 




<16 <16 



individuals equivalence classes 

Fig. 2. Isolating a variable 



Moving a Variable across Classes. A variable V and an equivalence class C 
containing variables of the same type as V are selected in Ind. V is added to C 
in Ind' , and the rest of the partition is kept unchanged. 

C keeps its numeric constraints. 

Figure El gives an example where X2 is added to the class labelled 3 . 
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Fig. 3. Moving a variable across classes 



Splitting a Class into Two Ones. A class C is selected in Ind. Ind' has the same 
equivalence classes as Ind, except that C is divided into two classes, and 

C'spiit^, which contain the variables of C, randomly distributed between them. 
Two cases are possible to treat numeric constraints: 

— constraints put on C are propagated to and , 

— has the same constraints as C, and those of are fired at random. 

In the example given in Fig. IH the class C2 is selected in Ind and is split into 
C2 and C3 in Ind' . Constraints on C2 are those of C2, while constraints on C3 are 
fired at random (treatment 2). 
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Let us note that isolating a variable is a special case of splitting a class. 
Nevertheless, it often breaks fewer equality relations in C, so that it can be 
interesting when few changes are necessary in the partition to achieve a better 
solution. 
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Fig. 4. Splitting a class 



Merging Two Classes. Two classes C and Cj (i yf j) with the same type are 
selected in Ind; Ind' encodes the same partition as /nd, except that the variables 
of Ci and Cj are gathered into a same class C^ej-ge- 

In case of numeric variables, three cases can be considered to put numeric con- 
straints on C: 

— = mm{rninci,rnincj) and = max(maa;Ci, maccCj), 

— = max{minc-,mincj) and = min{maxci,rnaxcj), 

— mine' = mine and max c> = maxr ■ 

^merge ^merge 

On Fig. | 5 ] the two classes C2 and C3 of Ind are merged into C2 in Ind' . The 
treatment on constraints corresponds to the first one described. 
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Operations Involving Two Individuals. 



Let us denote by Indi and Ind,2 the individuals concerned, by Ind'^ and Ind'2 
their children. 

In the figures, we denote hy Cn a class labelled by I in Parent i and by C'; a 
class labelled by I in Child i. 



Union. A class Cn is selected in Indi, and another class C2j in Ind2 such that 
Cii and C2j have a common variable. After the operation, Ind'^ and Ind'2 encode 
the same partition as Indi and Ind2 respectively, except that variables of Cij 
and C2j are gathered in a unique class Cunion- Numeric constraints on Cunion are 
determined as in the case of merging two classes. 

In Fig. Class C12 is selected in Indi; only two classes have a common 
variable with C12 in Ind2, namely C22 and C24, and we choose C24. The resulting 
children encode the same partitions as their parents, except that variables of C12 
and C22 are gathered in a same class. In this example, treatment on numeric 
constraints is the first one described. 
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Fig. 6. Union 



Intersection. A class Cu is selected in /ndi, and another class C2j in Ind2 such 
that Cii and C2j have a common variable. After the operation, Ind'^ and Ind'2 
encode the same partition as Indi and Ind2 respectively, except that variables 
common to Cu and C2j are gathered into a class Center- Numeric constraints on 
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Winter determined according to one of the cases described for the merging 
two classes operation. 

Figure[7]gives an example where Class C12 is selected in Indi, only two classes 
have a common variable with C12 in Ind,2, namely C22 £^nd C23, and we choose 
€22- The resulting children encode the same partitions as their parents, except 
that variables of C12 and C22 are gathered in a unique class. For this example, 
treatment on numeric constraints is the second described. 
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For these two operators, we require a common variable between the two 
classes involved. As and C2j belong to individuals selected for their quality, 
they induce equality relations potentially good for the problem, and we hope to 
gather interesting links connected by this variable. In the case of intersection 
this requirement is necessary to have an effect. In the case of union this avoids 
to have too many side effects. For example, if we had chosen Classes Cn and C24 
in Fig. |6|, every class would have been disturbed, with nothing indicating in Indi 
or Ind2 that Variables X\,Xq and A4, A5 could be advantageously gathered. 

Operations on Constraints. 

These operations involve one individual and consist in altering constraints put on 
a given numeric class, to reinforce or relax them. For each of them, the minimum 
and maximum values can be altered or not depending on probabilities. 
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Reinforcing Constraints on a Class. If altered, the minimum (resp. maximum) 
takes a value greater (resp. lower) than the previous one. 

Relaxing Constraints on a Class. If altered, the minimum (resp. maximum) takes 
a value lower (resp. greater) than the previous one. 

3 Operators Properties 

In Inductive Logic Programming the most current definition of generalization 
between clauses is 0-subsumption m, defined as follows: 

Definition 1 (0-subsumption between clauses). Let C : he Bq and 

D : h]j •«— B]j be two clauses. C 9-subsumes D (or C is more general than D 
by 9 -subsumption), written C > D, when there exists a substitution 9 such that 
9{hc) = hu and 9{Bc) C B]j. 

Nevertheless, this definition must be extended to handle numeric data: with 
0-subsumption two relations A1 = 3 and X = 5 are generalized into X = Y, 
whereas the expected relation would be 3 < X < 5. To deal with this, several 
works I14I2I10I have suggested to model learning with symbolic and numeric 
data in the framework of Constraint Logic Programmming, and have proposed 
a new definition of generalization between constrained clauses | |14I2| . 

Definition 2 (ce-subsumption between constrained clauses). Let C and 

D be two constrained clauses he ^ Ce,Be and he ^ ce,Be where Ce, ce 
are conjunctions of constraints, he, he are atoms and Be, Be are conjunc- 
tions of atoms. The clause C( ce-subsumes D, written C Y D, iff there exists a 
substitution 9 such that \=e V(cd — >■ Ce9), 9{he) = he and 9{Be) Q Be. 

In this paper, we are mainly interested in numeric constraints, expressed by 
X comp a, with comp S {<, >}. 

We call operator the combination of an operation on equivalence classes and 
a corresponding operation on the constraints associated to the numeric classes 
possibly involved, or an operation on constraints only. Some of these operators 
enable to specialize or generalize a rule under ce-subsumption, while the others 
do not have a determined effect but enable to explore the search space, being 
partly guided by the information contained in the parents. 

Let us denote by V(ci) the set of variables occurring in a clause ct, c the clause 
coded by Ind and c' the clause coded by Lnd' . Let us notice that the variables 
of Lnd are those of the pattern, whereas V(c) is composed of representatives of 
the equivalence classes coded in Lnd. 

Proposition 1. Isolating a variable together with treatment 1 on numeric con- 
straints generalizes a rule. 

Proof. Let us denote by V the isolated variable of Lnd with V G C. If V is not 
a numeric variable: 
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— If C = {y}: nothing is done and Ind' = Ind. 

-lic^ {vy. 

• if y G V(c) (y is the representative of Class C in c), there exists U such 
that U € C, U € V(c') and U ^ V. More precisely, V occurs in c' as the 
representative of the new class, and a new variable U G C represents the 
class C — {V}. 

Denoting a = {U/V}, we obtain c'a = c. 

• if y ^ V{c), there exists U such that U G C, U G V(c) and U 
Denoting a = {V/U}, we obtain c'a = c. 

So c' 0-subsumes c. 

If y is a numeric variable, the same applies but the constraint a <V < b in 
c is transformed into a < V < b A a < U < b in c' . Applying the substitution a 
to(a<y <bAa<U<b) leads to a<y < b A a < V < b which is equivalent 
to a < y <b. Therefore |= c — >■ c'9. So c' ce-subsumes c. 



Proposition 2. Splitting a class together with treatment 1 on numeric con- 
straints generalizes a rule. 

Proof. If the class does not contain numeric variables: 

- If C'spiit, = {} or = {}: nothing is done and Ind' = Ind. 

— If yf {} and Csput 2 {}: there exists U and V such that U,V G C, 

U G V(c), y ^ V(c) and U,V G V(c') {U is the representative of Class C that 
occurs in c, V does not occur in c; U is the representative of Class in 

c', y is the representative of Class in c'). 

Denoting a = {V/U}, we obtain c'a = c. 

So c' 0-subsumes c. 

If the class contains numeric variables, the proof is similar to that of Prop. 
2. So c' ce-subsumes c. 

Proposition 3. Merging two classes together with treatment 2 on numeric con- 
straints specializes a rule. 

Proof Let us assume that the representative that is kept is U and thus belongs 
to Cl- 

There exist U and V such that U G C\, U G V(c) and U G V(c'), V G C 2 , 
V G V(c) and y ^ V(c'). 

If U and y are not numeric variables: denoting a = {V/U}, we obtain ca = d . 
So c0-subsumes d . 

If U and y are numeric variables with constraints a < U < b and d <V < e 
in c, then the new constraints on U in d are max(a, d) <U < min(6, e). Applying 
the substitution a to a < U < b and d<V<e leads to a < U < b and d< U < e 
and max(a, d) < U < min(6, e) implies a < U < b and d < U < e. Therefore 
1= c' — >■ cd. So c ce-subsumes d. 
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Proposition 4. Intersection together with treatment 1 on numeric constraints 
generalizes a rule. 

Proof. Let us now denote by ci the clause encoded by Ind \ , C 2 the clause encoded 
by Ind, 2 , c'^ the clause encoded by Ind'^ and C 2 the clause encoded by Ind' 2 . 

Let U be the variable that occurs in ci and represents Class Class Cn is 
split into Cinter £^nd a class containing the other variables of C^. 

Let us denote by V and W the representatives of the two classes. We have 
either U = W or V = W. 

If V is not a numeric variable, the proof is the same as for the splitting case, 
and Cj 0-subsumes Ci . 

If P is a numeric variable: let us denote by U the representative of Class Cu 
in Cl, by y the representative of Class C 2 j in C 2 and by W the representative of 
Class Cinter in c'l (it may happen that U=V or U = W or V = W). 

li a < U < b (resp. c < V < d) is the constraint linked to U (resp. V) in Indi 
(resp. Ind 2 ) then the constraint linked to W is min(a, c) < W < max(6,d). It is 
easy to show that substitution cr satisfies {a<U<bAc<V < d) h (min(a, c) < 
W < max{b,d))cr). So c'l ce-subsumes ci. 

The proof is the same for C 2 ce-subsumes C 2 . 



Proposition 5. Reinforcing constraints specializes a rule. 

Proposition 6. Relaxing constraints generalizes a rule. 

For the two last propositions, proofs are similar to the previous ones. 

4 Preliminary Experiments 

We have implemented this approach in C-|— k, and used the Database Manage- 
ment System PostgreSQL. Tests have been performed on two classical datasets, 
namely a family dataset with grand-father as the target concept and Michalski 
10 trains dataset m- 

For the time being, we are only searching for a conjunctive description of the 
concept, that covers at least one example and rejects many counter-examples. 
Our fitness function thus gives a null quality to an individual that does not 
cover any example. If any is covered, it gives more weight to the rejection of 
counter-examples than to the coverage of examples, so that we favour consistent 
solutions. The probabilities used for the operators in our experiments are the 
following: 

— union: 0.3, 

— intersection: 0.3, 

— isolating a variable: 0.05, 

— moving a variable across classes: 0.05, 
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— splitting a class: 0.1, 

— merging two classes: 0.1. 

During the initialization step, for each individual, all variables are considered 
in the order they appear, and a parameter indicates whether it is put alone or 
in an existing class (if there exists one, else a new class is however created). In 
our experiments, we create individuals with a large amount of classes, and this 
parameter is set to 0.9. 

Results are encouraging. For the family example, our database is composed 
of 4 tables: one for the relation father, one for the relation mother, one for storing 
the examples and one for storing the counter-examples. For instance, we pro- 
vide our program with 4 relations: grandfather, mother, father and mother, and 
set the population size to 20 individuals. Consistent definitions for grandfather 
rapidly emerge, as for instance the solution 01210267 which corresponds to 
one of the possible definitions of a maternal grandfather. Providing it with 8 
relations, with a population of 30 individuals, we also obtain the results very 
quickly. 

For our tests on the 10 trains dataset, we have worked on a database com- 
posed of 25 relations. Indeed, we have converted relations such as Ihshape, which 
indicates the form of a given load, into 4 relations, one for each shape, because 
we do not learn constraints on symbolic classes at the moment. For instance, 
we have provided it with 14 relations (the target concept east and 13 others, all 
different), leading to individuals composed of 30 genes which correspond to the 
30 variables appearing in the pattern. With a population of 30 individuals, the 
system has converged rapidly towards, for example, the consistent description 
which states that a train goes east if it has a rectangular wagon whose position is 
equal to the number of wheels of a wagon. This description covers two examples 
and our algorithm has to be embedded in a complete system in order to learn a 
set of rules that covers all the examples. 

5 Related Works 

Our work can be compared to the approach presented in | 16| . since both works 
address the same task of searching the subsumption lattice by means of Genetic 
Algorithms. Nevertheless, we propose a more compact representation of the in- 
dividuals: the links between N variables are represented in our approach by an 
individual of length N, whereas it is represented in m by the top triangle of a 
matrix. The operators we propose are based on set operations whereas in their 
framework they are based on binary matrixes operations. 

Many works have dealt with applying GA to Goncept Learning in an attribute- 
value representation. Some attempts have been made in first order representa- 
tions. Nevertheless, the problem is much more complicated than with attribute- 
value representation. Indeed, when learning in an attribute-value context, the 
pattern on which a rule is based is known and can be represented by a fixed- 
length individual. When the description of the concept has to be expressed in 
first order logic, we cannot determine in advance the form of the optimal rule 
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(which literal occurs in, which variables of these literals are equal, . . . ) . In some 
genetic relational concept learners, a model for the rule is provided to the system: 
it is given by the user (REGAL [^, G-NET 0 ) or based on a seeding example 
(SIAOl [3]). In REGAL, as described in [^, a pattern (called a template) is a 
conjunction of literals in which an argument is replaced by a domain. It fixes 
the links between the variables in the literals, and the system only learns the 
values allowed among those of the given domains. For example, the template 
P(X, Y, [t!pi, . . . , Vpjn]) A Q{Y, [vqi, . . . , Vqn])) indicates that the second variable 
of P and the first variable of Q are linked, that the third argument of P can take 
the values Vpi, . . . , Vpm and the second argument of Q the values Vqi, . . . , Vqn- 

Let us notice that by the use of a pattern, such works can be seen as based 
on propositionalization. 

In SIAOl, relations between variables are modified either by mutations when 
changing a gene to a variable (and replacing with probability the other occur- 
rences of the past symbol by the new one in the individual), or when applying a 
one-point crossover (but its effect on the exploitation of relations between vari- 
ables seems more limited) or a generalization operator. For example, the seeding 
example P{a,b) A Q{b) could be generalized by mutation in P{a,Y) A Q{Y) 
or P{a,Y) A Q{b). In this approach relations between variables evolve, but the 
operators realize only generalization. 

6 Conclusion 

In this paper, we have mainly focused on the subtask of learning symbolic and 
numeric constraints in a pattern by means of a GA. Such a work could then be 
embedded in a propositionalization framework. A possible way of achieving this 
is for instance to use a star approach |T^, as follows: 

1. choose a seed example e 

2. define a starting pattern P that covers e 

3. refine P by learning symbolic links or numeric constraints 

4. if some examples are not covered, then iterate at Step 1 

Propositionalization is an interesting solution to the complexity problems linked 
to the size of the search space. Nevertheless, once a pattern is given, the database 
of positive and negative instantiations of this pattern must be built, and this 
leads to complexity problems comparable in some degree to that of Data Mining 
systems flattening a relational database into a single one. We have previously 
studied the interests of parallelizing GAs [H] , and we have shown that under some 
conditions it was interesting to distribute the database on several processors. 
Some approaches have been proposed to deal with this problem as for instance 
stochastic sampling m- 

The genetic approach that we present is original in the sense that instead of 
representing in an individual whether two variables are equal or not, we encode 
the equivalence classes involved by these equality relations, and we define genetic 
operators that manipulate equivalence classes and their associated numeric con- 
straints. 
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Preliminary results are encouraging. A difficult part for testing our algorithm 
is the setting of the different parameters, mainly probabilities linked to the ap- 
plication of operators. We plan to make experiments on artificial examples, to 
study the influence of these parameters according to the expected form of the 
clause (number of linked variables, numeric variables, ...). 
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Abstract. This paper is concerned with how to classify examples that 
are not covered by any rule in an unordered hypothesis. Instead of as- 
signing the majority class to the uncovered examples, which is the stan- 
dard method, a novel method is presented that minimally generalises the 
rules to include the uncovered examples. The new method, called Rule 
Stretching, has been evaluated on several domains (using the inductive 
logic programming system Virtual Predict for induction of the base hy- 
pothesis). The results show a significant improvement over the standard 
method. 



1 Introduction 

One major distinction between methods for induction of classification rules is 
whether they treat the hypothesis as an ordered or unordered set of rules. In 
the ordered case, there is no need for resolving classification conflicts among the 
rules, since the first applicable rule is used (such an hypothesis is commonly 
referred to as a decision list [15]). Furthermore, a decision list always includes 
a default rule at the end, which means that any example that may have passed 
through the previous rules without being covered will still be assigned a class. 
It should be noted that the standard inductive logic programming setting with 
two classes (positive and negative examples) and where a hypothesis is searched 
for that covers all positive examples but none of the negative, in fact is a special 
case of the ordered case, since it implicitly assumes that any example that is not 
covered by the rules should be classified as negative. In the case with unordered 
hypotheses, rules need to be generated for all classes and some strategy has to be 
adopted for resolving conflicts among the rules (e.g., 0). Furthermore, it may 
very well happen that none of the rules is applicable when trying to classify new 
examples. A common strategy for handling such examples is to classify them as 
belonging to the majority class (e.g., 0). 

In this paper we present a new method for classifying examples that are not 
covered by any of the rules in an (unordered) hypothesis. The method. Rule 
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Stretching, is applied after the hypothesis has been induced, during the classi- 
fication phase. Rule Stretching works by generalising the rules in a hypothesis 
to cover the previously uncovered examples. The method is not targeted at any 
special inductive logic programming system but is a general method for assigning 
classes to uncovered examples. 

The paper is organised as follows. In the next section, we present a general 
algorithm for Rule Stretching. This algorithm is specialised in Section with 
respect to a specific learning paradigm and a system called Virtual Predict, 
which is used for induction of base hypotheses. The setup of the experiments 
and the results are presented in Section The work presented in this paper has 
some ideas in common with Analogical Prediction m], which are discussed in 
Section IHl The paper ends with concluding remarks in Section [H The reader is 
assumed to be familiar with basic concepts of logic programming [2]. 

2 Rule Stretching 

Examples that are not covered by an unordered hypothesis are usually classified 
as belonging to a default class (usually the majority class). The work in this 
paper is instead based on the idea that unordered rules of an induced hypothesis 
can be ’stretched out’ to cover previously uncovered examples. Rule Stretching 
is used in the following way: 

1. Induce an unordered set of rules using an inductive logic programming sys- 
tem 

2. Classify new examples using the induced rules 

3. Examples that are not covered by any of the rules are given to the Rule 
Stretching system for classification 

A new, more general, hypothesis that is ensured to cover a previously un- 
covered example, can be formed by computing the minimal generalisation of the 
example and each rule in the hypothesis. The rules of the new hypothesis has to 
be evaluated since, by generalising the rules to cover the example, the accuracy 
of the rules may have been changed (important conditions could have been re- 
moved by the generalisation making it possible for a rule to cover more examples 
of other classes than it did before the generalisation). 

An example on how Rule Stretching works is illustrated in Figure [U where 
there are two classes a and b, two rules, R1 and R2, and an uncovered example 
denoted ’?’. In the picture to the left, the two rules and their coverage can be 
seen as well as the uncovered example. In the picture to the right the two rules 
have been generalised so that they cover the unclassified example and the class 
of the example can be determined by, for example, selecting the most probable 
class for the most accurate rule, which in this case means class a. Note that 
when generalising rule R2 it covers not only examples of class b but also two 
examples of class a (thus decreasing its accuracy) . 

A general algorithm for Rule Stretching is presented in Figure [21 The algo- 
rithm takes a hypothesis H, background knowledge B, examples E, an uncovered 
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Fig. 1. Rule Stretching 



example e G E such that H A B ^ e, and returns a class label. Three functions 
are used by the algorithm: the minimaLgeneralisation function, which returns 
the minimal generalisation of a rule and an example, the coverage function, which 
takes a rule and a set of examples and returns the number of examples of the 
different classes it covers, and the classify function, which returns a class label 
given a set of pairs of generalised rules and their coverage. 



Input: hypothesis H, background knowledge B, examples E, an uncovered example e 
Output: a class label c 

1. H' = {r' I r G H Ar' = minimaEgeneralisation{r, e)} 

2. V = {{r, v) \ r G H' A V = coverage{r, E)} 

3. c = classify{V) 



Fig. 2. Rule Stretching Algorithm (general version) 



3 Rule Stretching Using Least General Generalisations 

The general version of the Rule Stretching algorithm should be specialised with 
respect to the theoretical foundation of the inductive logic programming system 
that is used. In this study we consider the framework of the system Virtual 
Predict, which is described in Section IfTTl A special version of the Rule Stretching 
algorithm that takes advantage of the properties of this particular system is given 
in Section E21 



3.1 Virtual Predict 

Virtual Predict [2] is an inductive logic programming system that is a successor of 
Spectre 3.0 |3]. The system can be viewed as an upgrade of standard decision tree 
and rule induction systems in that it allows for more expressive hypotheses to be 
generated and more expressive background knowledge (i.e., logic programs) to be 
incorporated in the induction process. The major design goal has been to achieve 
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this upgrade in a way so that it should still be possible to emulate the standard 
techniques with lower expressiveness (but also lower computational cost) within 
the system if desired. As a side effect, this has allowed the incorporation of 
several recent methods, such as bagging, boosting and randomisation, that have 
been developed for standard machine learning techniques into the more powerful 
framework of Virtual Predict. 

Like its predecessor. Virtual Predict uses resolution as a specialisation op- 
erator 1^. This means that each rule generated by the system is the result of 
repeatedly applying resolution to some overly general clause. For reasons of effi- 
ciency, the system internally represents rules on the same format as it represents 
proofs of examples, namely as derivation terms. 

A derivation term is a term on the form Ci(ti , . . . , t„), where Ci is an identifier 
of some input clause in the derivation of the rule (or the proof of the example), 
and ti, . . . ,tn are derivation terms corresponding to the sub-derivations (or sub- 
proofs) for the n literals in the body of the clause c^. 

For example, given the following overly general theory: 

(cl) target (Size , Shape , Weight) : - 

size (Size), shape (Shape) , weight (Weight) . 

(c2) size(A):- A = small. 

(c3) size(A):- A = medium. 

(c4) size(A):- A = large. 

(c5) shape(A):- regular (A) . 

(c6) shape(A):- irregular (A) . 

(c7) regular(A):- A = circular. 

(cl5) weight(A):- A = low. 

the proof of the example target (small, circular, low) would be represented 
by the derivation term cl(c2, c5(c7), cl5). The derived rule 

target (Size , Shape , Weight) : - 

Size = small, regular (Shape) , weight (Weight) . 

is represented by the derivation term cl(c2, c5(_), _). 

It should be noted that a derivation term for a derived rule typically is non- 
ground, while a derivation term for a proof of an example always is grounc0. 
By finding the proofs of all examples in advance of the induction process and 
by representing the proofs together with derived rules as derivation terms, the 
coverage check of a derived rule and an example is reduced to unification, i.e. 
no theorem proving is needed. This has led to an order of magnitude speedup in 
Virtual Predict compared to its predecessor. 



^ It should also be noted that some built-in predicates (such as arithmetic predicates) 
need special treatment, but this falls outside the scope of the paper. 
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3.2 A Specialised Rule Stretching Method 

Since Virtual Predict represents the rules of a hypothesis as terms it is possible 
to compute the minimal generalisation of a rule and the proof of an example by 
computing the least general generalisation m- 

Definition 1. An atom c is a generalisation of atoms a and b if there exists 
substitutions 9i and 62 such that c9i = a and c02 = b. 

Definition 2. A generalisation c for two atoms a and b is a least general gen- 
eralisation (Igg) if for each other generalisation Ci of a and b there exists a 
substitution 9i such that c = Ci9i. 

A new more specific version of the general Rule Stretching algorithm, was 
formed by replacing the minimaLgeneralisation function with a function, Igg, 
that computes the least general generalisation of a rule and the proof of an exam- 
ple. Furthermore, the classify function was replaced with a function, use-best-rule, 
that given a set of evaluated rules returns the class of the rule that has the high- 
est accuracy (with Laplace-correction). The Igg version of the Rule Stretching 
algorithm can be seen in Figure |2] 



Input: hypothesis H, background knowledge B, examples E, an uncovered example e 
Output: a class label c 

1. H' = {r' \ r £ H A r' = lgg(r, e)} 

2. V = {{r, v) \ r £ H' A V = coverageir, E)} 

3. c = use -best _rule{V) 



Fig. 3. Rule Stretching Algorithm (Igg version) 



4 Empirical Evaluation 

A number of experiments were conducted in order to find out whether the Rule 
Stretching method performs better than choosing the majority class for uncov- 
ered examples. In all of the experiments the base hypotheses were induced by 
Virtual Predict. In Section [4T1 we describe how Virtual Predict was configured 
and the domains used. The experimental results are given in Section [4.21 

4.1 Experimental Setting 

There are a number of parameters that can be set when defining learning meth- 
ods in Virtual Predict, allowing a very wide range of methods to be defined, 
including the emulation of standard techniques, such as decision tree induction 
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and naive Bayes classification. The parameters include the search strategy to 
use (separate-and-conquer or divide-and-conquer), optimisation criterion (e.g. 
information gain), probability estimate (e.g. m estimate), whether an ordered 
or unordered hypothesis should be induced, inference method (how to apply the 
resulting hypothesis), post-pruning (using e.g. an MDL criterion or a prune set) 
as well as ensemble learning methods (bagging, boosting, and randomisation). 
There are also a number of parameters that have to be set when defining exper- 
iments in Virtual Predict, such as what experimental methodology to use (e.g. 
n-fold cross validation). 

The parameters and their values were in this study set according to Table [1] 
A covering (separate and conquer) approach to rule induction was used together 
with incremental reduced error pruning [S], by which a generated rule is imme- 
diately pruned back to some ancestor in the derivation sequence (the pruning 
criterion was in this experiment set to accuracy on the entire training set, but 
other options in Virtual Predict include accuracy on a validation set and most 
compressive ancestor). 

In case an example was covered by more than one rule this conflict was 
resolved by computing the most probable class using naive Bayes, by maximising 
the following expression: 

P'{C\Rl^...^Rn) = P{C)P{Rl\C)...P{Rr,\C) ( 1 ) 

where (7 is a class and R\ . . . are the rules that cover a particular example. 
It should be noted that in case a particular example is not covered by any 
rule, maximising the above expression leads to assigning the example the most 
probable class a priori, which is the standard method for classifying uncovered 
examples. 



Table 1. Virtual Predict settings 



Parameter 


Value 


Strategy 

Optimisation criterion 
Probability estimate 
Measure 

Incremental Reduced Error Pruning 
Inference method 
Experiment type 


Separate and Conquer 
Information Gain 
M Estimate, with M=2 
Information Gain 
Most Accurate on Training Set 
Naive Bayes 
10-Eold Cross Validation 



Rule Stretching was tested on the seven problems that can be seen in Table El 
along with some statistics about the number of classes, the distribution of the 
classes, the number of examples that were not covered by the base hypothesis, 
and the total number of examples in the domain. Four of the domains were col- 
lected from the UCI Machine Learning Repository: the Balance Scale Database, 
the Car Evaluation Database, the Congressional Voting Records Database, and 
the Student Loan Relational Database. The data for the problem of recognising 
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illegal positions on a chess endgame with a two kings and a rook, KRKI, and the 
problem of predicting the secondary structure of proteins (described in m) were 
available from the web page of the Machine Learning Group at the University 
of York. The Alzheimers toxicity domain was available from Oxford University 
Computing Laboratory. The domain was described in . 

In the secondary protein structure domain the hypothesis was restricted 
to looking at properties for only three positions at a time (i.e., the predicate 
alpha_triplet/3 was used). 



Table 2. Domain statistics 



Domain 


Classes 


Class 

Distribution (%) 


Uncovered 
Examples (%) 


Total 

Examples 


Balance 


3 


7.84; 46.08; 46.08 


19.04 


625 


Car 


4 


3.76; 3.99 


3.36 


1728 


House Votes 


2 


22.22; 70.02 
38.62; 61.38 


2.53 


435 


KRKI 


2 


34.2; 65.8 


2.6 


1000 


Alzheimers tox. 


2 


50; 50 


6.54 


886 


Secondary Protein 
Structure 


2 


43.4; 56.6 


27.61 


1014 


Student Loan 


2 


35.7; 64.3 


5.4 


1000 



4.2 Experimental Results 

Two methods for classifying uncovered examples were compared in the seven 
domains: selecting the majority class and Rule Stretching. The same base hy- 
potheses were used in conjunction with the two methods, and these were pro- 
duced by Virtual Predict using the settings shown in the previous section. In 
all but one of the domains, 10-fold cross validation was employed. Due to long 
computation time for the secondary protein structure domain only a single run 
was made, using a single training and test set. 

The null hypothesis was that Rule Stretching is not more accurate than se- 
lecting the majority class. The results of using the two methods is shown in 
Table 121 One can see that in seven out of seven cases. Rule Stretching results in 
more accurate classifications than when assigning uncovered examples the ma- 
jority class. The one sided binomial tail probability of this number of successes, 
given that the probability of success is 0.5, is approximately 0.0078, which allows 
for a rejection of the null hypothesis at the 1% level. 

5 Related Work 

The idea of including the example to be classified in the formation of the hy- 
pothesis is shared with Analogical Prediction |TT]. Analogical Prediction uses 
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Table 3. Resnlts 



Domain 


Majority Class 


Rule Stretching 


Balance 


76.32% 


84.64% 


Car 


93.98% 


94.16% 


House Votes 


94.25% 


95.4% 


KRKI 


97.1% 


99.4% 


Alzheimers Tox. 


88.26% 


90.18% 


Secondary Protein Structure 


57.99% 


61.14% 


Student Loan 


92.2% 


92.7% 



background knowledge, training examples, and the example to be classified to 
form a hypothesis which is used to classify the example. For every new example 
that is to be classified, a new hypothesis is formed. This leads to a different 
behavior when classifying examples than the normal setting of inductive logic 
programming does. 

The main difference between Rule Stretching and Analogical Prediction is 
that Analogical Prediction forms a new hypothesis for every example that is to 
be classified whereas in Rule Stretching a previously induced hypothesis is used 
to classify examples that it covers, and rules of the hypothesis are minimally gen- 
eralised to cover the remaining examples. This results in that Rule Stretching 
only evaluates as many candidate rules as there are rules in the original hypoth- 
esis, while Analogical Prediction performs a costly search for each example to 
be classified, typically evaluating a large number of candidate rules. 

Another related work, although not specifically aimed at the problem of 
classifying uncovered examples, was described in [T] where a two-layered system 
for handling imprecise definitions was presented. The rules of the first layer are 
supposed to capture the basic properties of the concepts while the second layer 
defines possible modifications of the basic properties which makes it possible to 
classify examples not covered by the rules of the first layer. 



6 Concluding Remarks 

A novel method for classifying examples that are not covered by an unordered 
hypothesis has been presented. The method. Rule Stretching, is based on the 
assumption that a more accurate classification can be made by generalising rules 
of a base hypothesis to cover the uncovered examples than using the standard 
method of assigning the examples to the majority class. The experiments, in 
which the inductive logic programming system Virtual Predict was used for the 
induction of a base hypothesis, showed that Rule Stretching performs signifi- 
cantly better than the standard method. 

There are several directions for future research. One is to alter the least 
general generalisation version of the Rule Stretching algorithm, by replacing the 
use-best-rule function with some other, more elaborate, function. For example, 
all of the generalised rules could contribute to the decision of the correct class 
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label to return, by using naive Bayes to find the most probable class given all 
of the generalised rules and their coverage. Another alternative is to replace the 
classify function with a CN2 type of function [^. 

Another interesting direction for future research would be to formulate a ver- 
sion of Rule Stretching for some other system, such as Progol [TO]. Since rules 
and examples are not represented as terms in Progol, it is not possible to com- 
pute the minimal generalisation by computing the least general generalisation. 
Instead, rules in a hypothesis could be stretched out to include uncovered ex- 
amples by using relative least general generalisation |14| . However, one major 
drawback of using relative least general generalisation (compared to computing 
the least general generalisation of a pair of atoms as done in this study) is that 
the computational cost is significantly higher. 

Yet another direction for future research would be to relax the condition in 
Rule Stretching that each rule is generalised minimally. One possibility is to 
allow the system to search in the lattice formed by the rule to be stretched and 
the most general rule. This would however be significantly more costly than the 
current approach. Another possibility would be to let the generalisation process 
continue after having classified an uncovered example by computing the mini- 
mal generalisations of the generalised hypothesis and a new uncovered example 
and stop this process only when the accuracy of the rules in the hypothesis 
significantly decreases. 
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Abstract. In propositional learning, boosting has been a very popular 
technique for increasing the accuracy of classification learners. In hrst- 
order learning, on the other hand, surprisingly little attention has been 
paid to boosting, perhaps due to the fact that simple forms of boosting 
lead to loss of comprehensibility and are too slow when used with stan- 
dard ILP learners. In this paper, we show how both concerns can be ad- 
dressed by using a recently proposed technique of constrained confidence- 
rated boosting and a fast weak ILP learner. We give a detailed description 
of our algorithm and show on two standard benchmark problems that 
indeed such a weak learner can be boosted to perform comparably to 
state-of-the-art ILP systems while maintaining acceptable comprehensi- 
bility and obtaining short run-times. 



1 Introduction 

In recent years, the field of Machine Learning has seen a very strong growth of 
interest in a class of methods that have collectively become known as ensemble 
methods. The general goal and approach of such methods is to increase predic- 
tive accuracy by basing the prediction not only on a single hypothesis but on 
a suitable combination of an entire set of hypotheses. Boosting is a particularly 
attractive class of ensemble methods since on the one hand it has originated in 
theoretical studies of learnability, but on the other hand has also been developed 
into practical algorithms that have demonstrated superior performance on quite 
a broad range of application problems. Boosting constructs multiple hypotheses 
by first calling a “weak” learner on the given examples to produce a first hypoth- 
esis. During each subsequent round of boosting, the weight of examples correctly 
handled by the hypothesis induced in the previous round is decreased, while the 
weight of examples incorrectly handled is increased. In the resulting set of hy- 
potheses, each hypothesis gets a voting weight corresponding to its prediction 
confidence, and the total prediction is obtained by summing up all these votes. 

Given the set of boosting approaches in propositional learning, it is somewhat 
surprising that boosting has not received comparable attention within ILP, with 
a notable exception of Quinlan’s [H] initial experiments. There are two possible 
reasons for this situation which appear especially relevant. Firstly, understand- 
ability of results has always been a central concern of ILP researchers beyond 
accuracy. Unfortunately, if, as in Quinlan’s study, one uses the classic form of 
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confidence-rated boosting (Adaboost.Ml) the result will be quite a large set 
of rules each of which in addition has an attached positive or negative voting 
weight. To understand the behaviour of one rule in this rule set, it is necessary 
to consider all other rules and their relative weights, making it quite difficult 
to grasp the results of the learner. Secondly, in propositional learning, boost- 
ing is often applied simply by using an unchanged existing propositional learner 
as a basis. If one carries this over to ILP (e.g. Quinlan simply used FFOIL as 
a base learner), the run-times of such a boosted ILP learner clearly would be 
problematic due to the high effort already expended by a typical ILP system. 

In this paper, we show that both of these concerns can be addressed by suit- 
ably combining recent advances in boosting algorithms with a fast weak learner. 
In particular, we show how constrained confidence-rated boosting (CCRB), which 
is our denomination and interpretation of the approach described in [2], can be 
used to significantly enhance the understandability of boosted learning results 
by restricting the kinds of rule sets allowed. We combine this with a greedy top- 
down weak learner based on the concept of foreign links introduced in Midos m 
which uses a limited form of look-ahead and optimizes the same heuristic crite- 
rion as used in |2]. In an empirical evaluation on two known hard problems of ILP, 
the well-studied domains of mutagenicity and Qualitative Structure Activity Re- 
lationships (QSARs), we show that indeed such a simple weak learner together 
with CCRB achieves accuracies comparable to much more powerful ILP systems, 
while maintaining acceptable comprehensibility and obtaining short run-times. 

The paper is organized as follows. In section 2, we review boosting, and mo- 
tivate the basic ideas of constrained confidence-rated boosting based on [2]. In 
section 3, we briefly describe our foreign link based weak learner and give a 
more detailed account of the heuristic evaluation functions employed to guide 
the search in the constrained hypothesis space. Section 4 details how the hy- 
potheses generated by the weak learner are used in the framework of CCRB. 
Our experimental evaluation of the approach is described and discussed in sec- 
tion 5. In section 6, we discuss related work in more detail. Section 7 contains 
our conclusions and some pointers to future work. 



2 Boosting 

Boosting is a method for improving the predictive accuracy of a learning sys- 
tem by means of combining a set of classifiers constructed by a weak learner 
into a single, strong hypothesis imSTT!. It is known to work well with most un- 
stable classifier systems, i.e. systems where small changes to the training data 
lead to notable changes in the learned classifier. The idea is to “boost” a weak 
learning algorithm performing only slightly better than random guessing into an 
arbitrarily accurate learner by repeatedly calling the weak learner on changing 
distributions over the training instances and combining the set of weak hypothe- 
ses into one strong hypothesis. In the resulting set of hypotheses, i.e. the strong 
hypothesis, each hypothesis gets a voting weight corresponding to its prediction 
confidence, and the total prediction is obtained by summing up all these votes. 
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A probability distribution over the set of training instances is maintained. 
The probabilities model the weights associated with each training instance and 
indicate the influence of an instance when building a classifier. Initially, all in- 
stances have equal influence on the construction of the weak hypotheses. In each 
iterative call of the learner, a weak hypothesis is learned, which computes a 
prediction confidence for each example. How this confidence is determined is 
a design issue of the weak learning algorithm and will, for our approach, be 
discussed in detail in section 

On each round of boosting, the distribution over the training instances is 
modified in accordance with the learned weak hypothesis, i.e. in dependence of 
its assigned prediction confidence and the examples covered by it. The weights 
of misclassifled instances are increased and, in analogy, those of correctly clas- 
sified instances are decreased according to the confidence of the learned weak 
hypothesis. Thus, correctly classified instances will have less influence on the 
construction of the weak hypothesis in the next iteration, and misclassifled in- 
stances will have a stronger influence. That way, the learner is confronted in 
each new round of boosting with a modified learning task and forced to focus 
on the examples in the training set which have not yet been correctly classified. 
Finally, all weak hypotheses learned are combined into one strong hypothesis. 
An instance x is classified by the strong hypothesis by adding up the prediction 
confidence of each weak hypothesis covering x and predicting the class y oi x 
as positive if the sum of confidences of all hypotheses covering x is positive, 
otherwise predicting y as negative. 

The classic form of (unconstrained) confidence-rated boosting (Adaboost.Ml) 
yields quite a large set of rules each of which in addition has an attached pos- 
itive or negative voting weight. Moreover, each weak hypothesis may vote with 
different confidences for different examples. This way, rules inferring the target 
predicate are learned as well as rules for the negation of the target predicate. 

In our ILP setting, we will, in contrast, firstly assume that the weak learner 
produces on each iteration a hypothesis in form of a single Horn clause 
H •<— Li, L 2 , ■ ■ ■ , Ln [c] with an associated real number c, where H is the atom 
p{Xi, ■ ■ ■ ,Xa(p)) and p the target predicate of arity a{p), the Li are atoms with 
background predicates pi, and c represents the prediction confidence of the hy- 
pothesis. This prediction confidence is used as the voting weight of the hypothesis 
on all examples covered by it, where large absolute values indicate high confi- 
dence. Moreover, we will restrict the weak hypothesis to vote “0” to abstain on 
all examples not covered by it. 

Thereby, the semantics of a rule is, as opposed to usual ILP practice, deter- 
mined by the sign of its attached prediction confidence. A hypothesis 
H •<— Li, L 2 , • • • , Ln [c] such that c > 0 implies that H is true. It is interpreted 
as classifying all instances covered by it as positive with prediction confidence 
c. H [c] such that c < 0 implies that H is false and is interpreted as classifying 
each instance as negative. 

Here is an example of a boosting result consisting of 7 weak hypotheses when 
learning a target predicate p. 
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1. p(X) ^ q(X,a). [0.2] 4. p(X) ^ q(X,Y), v(Y). [-0.6] 

2. p(X) ^ q(X,Y), r(Y). [0.9] 5. p(X) ^ r(X). [-0.5] 

3. p(X) ^ s(X). [0.1] 6. p(X) ^ q(X,b). [-0.3] 

7. p(X) ^ t(X). [-0.9] 

In order to classify a new instance about which we know q(l,a), v(a), t(l), 
s(l), we need to check which hypotheses cover this example. Here, we find that 
1,3, 4, 7 cover the example, so we compute the sum of their confidences, yielding 

0.2-I-0.1 — 0.6 — 0.9 = —1.2 < 0, and the instance is classified as negative. In other 
words, to understand the behaviour of one rule in this rule set, it is necessary 
to consider all other rules and their relative weights, making it quite difficult to 
grasp the results of the learner. 

In our approach of constrained confidence-rated boosting, which is our inter- 
pretation of the ideas in [^, we will restrict each hypothesis to either of two 
forms. A hypothesis is either positively correlated, i.e. predicting the positive 
class, and equipped with a positive prediction confidence, or it is the default 
hypothesis p{Xi, ■ ■ ■ ,Xa(p)) with an assigned negative confidence. Constraining 
the hypotheses to either of these two forms ensures that the resulting set of 
hypotheses can be more easily interpreted. Namely, in order to appraise the 
quality of a hypothesis, it suffices to consider its assigned prediction confidence 
in proportion to just the weight of the default hypothesis, instead of having to 
consider all other hypotheses and their assigned weights. 

Using the additional restrictions, we see for the above example that with 
CCRB only results of the following form would be allowed, making learning 
harder but guaranteeing better understandability: 

1. p(X) ^ q(X,a). [0.2] 4. p(X). [-0.3] 

2. p(X) ^q(X,Y), r(Y). [0.9] 

3. p(X) ^ s(X). [0.1] 

Since the same weak hypothesis might be generated more than once by the 
weak learner, we can further simplify the set of resulting hypotheses by summa- 
rizing hypotheses H [ci], ■ ■ ■ , H [c„], 1 < i < n, which only differ with regard to 
their assigned confidences. A set of such identical hypotheses can be replaced by 
a single hypothesis H' [c\,H' = Hi, 1 <i <n, with c = X)i<i<rt ''*■ 

The constraint on the weak hypotheses requires the weak learner to employ 
a search strategy guaranteeing that only positively correlated hypotheses with a 
positive prediction confidence are learned, or that the default hypothesis is opted 
for if no such positive correlated hypothesis can be induced from the training 
instances. j2] offer a theoretically well founded heuristics for this problem which 
will be discussed in more detail in the following sections. 

3 The Weak Relational Learner 

Our greedy top-down weak learner is using a refinement operator based on the 
concept of foreign links introduced in Midos m- This refinement operator is 
elucidated in detail in the following section. We will then discuss in section l!0 the 
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heuristics guiding the search of the greedy weak learner based on this refinement 
operator in the hypothesis space. In Table[ll we give a more concise description of 
the weak greedy learner embedded into the framework of constrained confidence- 
rated boosting. In the following, references to steps in Table [T] will be indicated 
by “Tl._”. 



3.1 The Refinement Operator 

The refinement operator p is based on the concept of foreign links introduced in 
Midos [ 13 ]. The hypothesis space consists of non-recursive, function- free Horn 
clauses C = H ^ B, where H is the atom p{Xi,- ■ ■ , ^a(p)) and p the target 
predicate of arity a{p). In order to constrain the complexity of the hypothesis 
space, our weak learner employs a foreign literal restriction |14] as declarative 
bias which is a constrained form of linkedness of clauses. When specializing a 
clause C by adding a new literal L, L must share at least one variable with 
previous literals in C. The foreign literal restriction further confines the set of 
alternative literals by means of an explicit definition of those literals and variable 
positions that are to be considered for refinement. Hypotheses are only refined 
along link paths designated by these definitions, or so called foreign links. For 
a clause C = Li, • • • , L„, a foreign link between a variable V first occurring at 
position Pi in literal Li with predicate name r, and a different variable U first 
occurring at position pj in Lj with predicate name s is defined as r[pi] — ?> s[pj\. 

Furthermore, we employ a limited form of look-ahead in our refinement op- 
erator in order to avoid the shortsightedness problem with respect to existential 
variables in the hypotheses generated by the greedy weak relational learner. 
Merely introducing new existential variables in a clause will probably not lead 
to notable changes, and the greedy learner is apt to rather select a literal that 
restricts existing variables. Thus, when specializing a clause C into C = C,L 
by means of adding a new literal L to C, we concurrently add to the set p{C) of 
refinements of C all specializations of C" obtained by successively instantiating 
the new variables in L. 

Given, for example, a target predicate active/1, a predicate atm/3, and a 
foreign link declaration active[l] — )> atm[l], applying p on C = active(Xi) would 
result in the specializations 

active(Xi) ^ atm(Xi, X2, X3), 
active(Xi) ^ atm(Xi, c, X3), 
active(Xi) ^ atm(Xi, c/, X3), 
active(Xi) ^ atmlXi, X2, X3) , X3 < —0.782, 
active(Xi) ^ atm(Xi, X2, X3), X3 > —0.782, 
active(Xi) ^ atmlXi, X2, X3) , X3 < 1 .002, 

active(Xi) ^ atm(Xi, X2, X3), X3 > 1.002, 

if X2 is a nominal variable with the domain {c, cl}, and X3 is a continuous 
variable with discretization T> = [—0.782, 1.002]. 

More generally, let L = r(Vi, • • • , Va(r)) be a literal with predicate name r of 
arity a(r), and let Vars{L) denote the variables in L not occurring in the clause 
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C to be specialized. Then, adding C" = C,L to p{C) results in additionally 
adding to p{C) the following refinements: 

1. C,L0{, 1 < i < a(r), 1 < j < \V al{yi)\, such that Vi G Vars{L) and 14 is a 

variable with nominal values, where V al(Vi) denotes the domain of variable 
14 and LOl = r(14, • • • , 14/c, , F,+i, • • • , K(r)), c, G Val{V) 

2. C,L,p{Vi),l < i < a{r) such that 14 G Vars{L) and 14 is a variable with 
continuous values. p(14) is defined for variables 14 with continuous values as 
follows. If 2? = {di, • • • , d„} is the discretization of the values of 14, then for 
any dk in 2? 14 < dk and 14 > dk are in p{Vi). 

Let C = Li, • • • , be a clause to be specialized, and let t/i, ■ • • , Uk-i^Uk+i^ 

■ ■ • I Ua{s) be new variables not occurring in C. Let a(r) denote the arity of a 
literal with predicate name r, and let Val(V) denote the domain of a variable 
V. Furthermore, let F be the set of all foreign links defined for the literals 
at hand. Then the refinement operator p can be defined as follows: for any 
Li = r(Vi, • • • , 14i(r)) in C such that r[m] — >■ s[fc] G F, 

L Ll, * * * , Lyj, s(t/l, * * * , Uk-l-j 14n; Uk-\-l^ * * * ; blo(s)) ^ p(^) 

, * * * , Lji , s(^Ui , * * * , Uk— 1 , 14n ; Uk-\-l 5 * * * ; bla(s) ) ^ 

for 1 < I < a{s), 1 < j < \Val{Ui)\, Ui a variable with nominal values 

3- 2/1, * * * , Lji^ s(t/i, * * * , Uk—i-, 14n? Uk-\-\j * * * ; p{JJi^ G p{C^ 

for 1 < I < a(s), U[ a variable with continuous values. 

3.2 Search Strategy 

Our weak first-order inductive learner accepts as input instances from a set 
E = U E~ of training examples along with a probability distribution D 
over the training instances. The background knowledge is provided in form of a 
set B of ground facts over background predicates. However, we will sometimes 
write F+ and E~ somewhat differently than used in ILP, and will say that 
E = {(a:, 1) I a; G F+} U {(a;, —1) | G E~}. 

To avoid overfitting in the weak learner, the training instances are randomly 
split into two sets, Q,V, used to specialize clauses and to prune these refine- 
ments later on, respectively. Starting with the target predicate, the weak learner 
greedily generates specializations which are positively correlated with the train- 
ing instances and thus have a positive prediction confidence on the training set. 

When thinking about strategies to guide the search of a greedy learner, en- 
tropy based methods like information gain represent an obvious choice. However, 
the theoretical framework of boosting provides us with a guiding strategy based 
on one of the specific features of boosting, namely the probability distribution 
being modified in each iterative call of the weak learner. 

As suggested by , the training error can be minimized by searching in each 
round of boosting for a weak hypothesis maximizing the objective function 

2(C) =def. yw+{c,g) - ^w_{c,g )) " (1) 
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which is based on the collective weight of all positive and negative instances in 
0 covered by clause C. For a clause C and a set S, the two weight functions 
w + , W- are defined by 



rc+(C,5)=de/. E (2) 

{xi,yi)^S covered by C,yi—1 

w.{C,S)=aef. E 

{xi^yi)^S covered by C ^yi— — l 

Since clauses C maximizing z{C) may be negatively correlated with the pos- 
itive class, we restrict, as proposed in |2], the search to positively correlated 
clauses, i.e. to clauses maximizing the objective function z defined as 



Z{C) =def. Vw+{c,g) - Vw-{c,g). (s) 



The refinement operator p of the weak relational learner iteratively refines, as 
described in detail in section lXTl the clause C currently maximizing the objective 
function z until either a clause C is found with hitherto maximal z{C) that 
covers only positive examples, or until the objective function z can not be further 
maximized (TlUni). 

The positively correlated clause C resulting from this greedy refinement pro- 
cess is subject to overfitting on the training instances, and is thus immediately 
examined to see whether it can be pruned. Namely, all generalizations of C re- 
sulting from deleting single literals and constants in C from right to left are 
generated fTl Ee|) . 

The objective function is only maximized on the set g based on which 
rules are generated by the weak learner. However, the evaluation of the prediction 
confidence of a weak hypothesis is based on the entire set of training examples. 
Thus, it is possible for the weak learner to learn a hypothesis C'[c],c < 0, 
which is, on the entire training set, negatively correlated with the positive class. 
Such hypotheses are not considered in order to ensure the constraint for a weak 
hypothesis to be either positively correlated or to be the default hypothesis. 
Thus, generalizations of C which have a non-positive prediction confidence on 
the whole training set are ruled out (Tl[2f]l. If no generalization of C with a 
positive prediction confidence exists, the default hypothesis is chosen as current 
weak hypothesis (Tim. The prediction confidence of a clause (7 on a set S is 
defined as 



c(C, 5) =def. 



w+{C,S) + ^ 
w-{C,S) + ^ 



(4) 



where N is the number of training instances and ^ is a smoothing constant 
applied to avoid extreme estimates when W-{C,S) is small. 

All generalizations of C with a positive prediction confidence on the entire 
training set are then evaluated with respect to their confidence on the set g and 
their coverage and accuracy on the set V. This kind of evaluation is proposed by 
|2] who define, based on the definition of the loss of a clause C with associated 
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confidence c{C,Q) of j2], a loss function for a clause C as 



loss{C) =def. (1 - {w+{C,P)+w.{C,V))) 

+ w+{C,V) ■ + w_{C,V) ■ 



( 5 ) 



This loss function is minimized over all generalizations of C with a positive 



prediction confidence (Tl,2(h)i 



In a last step, the positively correlated generalization C of C with minimal 
loss(C') is compared to the default hypothesis with respect to the expected 
training error (T1 2(h)iil. Since a positively correlated clause is compared to 



the default hypothesis predicting the negative class, the objective function to be 
maximized is in this case z as defined in equation 0 . Whichever of these two 
hypotheses maximizes z is chosen as the weak hypothesis of the current iteration 
of the greedy learner. 



4 Constrained Confidence-Rated Boosting of a Weak 
Relational Learner 

In this section, following [^, we explain how the weak hypotheses generated in 
each iteration of the weak greedy learner are used in the framework of CCRB 
1^. The weak learner is invoked T times. Let Ct denote the weak hypothesis 
generated in the t-th iteration based on the refinement operator and the heuristic 
search strategy described in the previous section. 

Ct is used in function /it : X — >• 5R, 



J c(Cf, E) if e = (x, y) is covered by Ct 
\ 0 else. 



mapping each instance a; to a real-valued number, i.e. to the prediction confidence 
of Ct on the entire training set if x is covered by Ct, and to 0 otherwise (TlIH). 

Before starting the next round of boosting, the probability distribution over 
the training instances, which is initially uniform, is updated by means of ht, 
namely by determining 



A*' = 






o{yi-ht{xi)) ' 



( 6 ) 



This way, the weights of all instances x not covered by the weak hypothesis 
Ct, i.e. such that ht{x) = 0, are not modified, whereas the weights of all positive 
and negative instances covered by Ct are decreased and increased, respectively, 
in proportion to the prediction confidence of Ct (Tin by means of ht- 
Then, the sum of the resulting weights is normalized 



= 



D 






x,l<i<N, 



( 7 ) 



so as to serve as the probability distribution of the next iteration. 
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Table 1. Constrained Confidence- Rated Boosting Algorithm 



Let N denote the number of training instances e = (xi,yi) G E = U E~ , p the 
target predicate of arity a{p), and let T denote the total number of iterations of the 
weak learner. Furthermore, let w+,W- denote the weight functions defined according 
to equation {21), c{C,S) the prediction confidence of a clause C on a set S defined 
according to equation ®, and z the objective function defined according to equation 
0 . 

1. Set Dj for 1 < i < 

2. For t = 1 . . .T 

(a) Split training set E randomly into Q and V according to Dt such that 

(b) C:=p(Xi,...,X„(p)) 

(c) Z--0 

(d) While W-{C,g) > 0 

i. Let C' := argmaxc" ep(c){Z{C")} 

ii. Let Z' ;= z{C) 

hi. If Z' — Z < 0 exit loop 

iv. Else C := C , Z := Z' 

(e) PrunesiC) ~ {p{Xi, - ■ ■ ,Xa(p)) ^ B \ C = p{X^,- ■ ■ ,Xa(p)) ^ BB'} 

(f) Remove from Prunes{C) all clauses C' where c{C',E) < 0 

(g) If Prrmes(C) = 0 let Ct := p(Ai, ■ • ■ , X,(p)) 

(h) Else 

i. C' := argmincn ^prunes(C){^oss{C" )} , where loss(C") is defined accord- 
ing to equation m 

ii. Let Ct.= argmaxc”e{c ,g) - ^Jw-{C",g)^ } 

(i) ht '■ X ^ is the function 



ht{x) = 



c{Ct,E) if e = (x, y) is covered by Ct 
0 else 



(j) Update the probability distribution Dt according to 



D] = 




if e = (xi,yi) not covered by Ct 
if e covered by Ct and e G E~ 
if e covered by Ct and e G E'^ 









y.D^ 

3. Construct the strong hypothesis 



7, 1 < i < AT 



H{x) ;= sign 



covered by C± 



c{Ct,E\ 
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After the last iteration of the weak learner, the strong hypothesis is defined 
by means of all weak hypotheses induced by the training instances. For each 
instance x the prediction confidences of all hypotheses covering x are summed 
up. If this sum is positive, the strong hypothesis classifies x as positive, otherwise 
X is classified as negative: 

H{x) := sign E c{C,,E)\ (8) 

\Ct:{x,y) covered by Ct ) 



5 Empirical Evaluation 

We conducted an empirical evaluation of our approach to CCRB on two do- 
mains, namely on the domain of mutagenicity [13j , which is a thoroughly inves- 
tigated benchmark problem for ILP learners, and on the domain of Quantitative 
Structure Activity Relationships (QSARs), another important test-bed for ILP- 
systems |4I5| . The weak learner is invoked T = 100 times. Although the number 
T of iterations can be automatically determined by cross-validation [^, we treat 
T as fixed in our experiments. 

Mutagenicity: The task is to predict the mutagenicity of a set of small, 
highly structurally heterogeneous molecules (aromatic and heteroaromatic nitro 
compounds). Mutagenic compounds are often known to be carcinogenic and to 
cause damage to DNA. Not all compounds can be empirically tested for mutage- 
nesis, and the prediction of mutagenicity is vital to understanding and predicting 
carcinogenesis. A molecule is described by its atoms, the bonds between them, 
global properties of and chemical structures present in the molecule. 

Several relational descriptions of the domain are available [12] , ranging from 
a weakly structured description only involving the atoms and bonds of the 
molecules, to a strongly structured description also involving high level chemi- 
cal concepts present in the molecules. We conducted our experiment with C^RIB, 
which stands for Constrained Confidence-Rated ILP Boosting, on the strongly 
structured description 64 restricted to a subset of 188 so called regression- 
friendly compounds 125 of which are classified as having positive levels of muta- 
genicity. The predictive accuracy is estimated by 10-fold-cross-validation, where 
we used the same folds as [12] for their experiments with Progol. The accuracy 
obtained in our experiment with C^RIB is displayed in Table [2] together with 
reference results on the 188— dataset using background knowledge B^ and the 
sources from which the results are reported. Runtim43 of C^RIB averages to 7 
minutes for 100 iterations, as compared to 307 minutes for Progol on all 188 com- 
pounds (the run-time for Progol was determined in experiments we performed 
on our spare SUNW, Ultra-4, in which we obtained the same accuracy as |13|1. 

In Table |2] we show only results obtained on the most comprehensive set 
of background knowledge, S4, which we have worked withQ As can be seen 

^ All run-times are referring to results obtained on a spare SUNW, Ultra-4. 

^ Additional results have been obtained by other authors on the B 3 dataset [12], in 
particular by STILL [llj (87 ± 8) and G-Net [l] (91 ± 8). 
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from the table, C^RIB performs on par with other ILP learners on the 10-fold- 
cross-validation data sets of the mutagenicity domain. Moreover, the results are 
obtained in reasonable time, and the final hypotheses represent fairly compre- 
hensible results. The number of literals in the final hypothesis averages to 64 
(32 clauses on average, where the body of each clause averagely comprises two 
literals), as compared to the result of averagely 46 literals in the hypotheses ob- 
tained by FOIL as published in and 28 literals on average in the hypotheses 
obtained by Progol. A final hypothesis obtained by C^RIB is displayed in Table 
12 in the appendix. 

QSARs: The task is to construct a predictive theory relating the activity of 
chemical compounds to their molecular structure. Often, these so called Qual- 
itative Structure Activity Relationships cannot be derived solely from physical 
theory, and experimental evidence is needed. Again, not all compounds can be 
empirically evaluated, and machine learning methods offer a possibility to inves- 
tigate QSARs. We conducted our experiments on a 5-fold-cross-validation series 
of 55 pyramidine compounds as described in [S]. A pyramidine compound is 
described by chemical groups that can be added at three possible substitution 
positions. A chemical group is an atom or a set of structurally connected atoms 
each of which is described by a set of chemical properties. QSARs problems 
are in general regression problems, i.e. not a class but real numbers must be 
predicted. To get around this problem for ILP, the greater activity relationship 
between pairs of compounds is learned. Rules learned for this relationship can 
then be employed to rank drugs by their activity. As opposed to [415] , we restrict 
our experiments to the prediction of the greater activity relationship between 
pairs of the 55 compounds. 

We conducted experiments on the same data sets with the systems Progol [2 
and FOIL |2 in order to obtain reference results on this domain (Table |2). The 
predictive accuracy obtained by C^RIB on the 5-fold-cross-validation data sets 
of QSARs domain is slightly higher than the ones obtained with the other two 
systems (however still within the range of the standard deviations) . Runtime of 
C^RIB averages to 57 minutes for 100 iterations, as compared to 372 and 0.7 
minutes for Progol and FOIL, respectively. The number of literals in the final 
hypotheses obtained by C^RIB averages to 142 (71 clauses on average, where 
the body of each clause averagely comprises two literals), as compared to 140 
and 154 literals on average in the hypotheses obtained by FOIL and Progol, 
respectively. The fact that FOIL yields good results in very short run-times 
suggests to investigate why FOIL’S heuristics are so successful and how elements 
of FOIL could be incorporated in our weak learner. 



6 Related Work 

The work described in this paper is based on recent research in the area of 
propositional boosting, and centrally builds on Cohen and Singer’s [2] approach 
to constrained confidence-rated boosting. However, the properties of the weak 
learner embedded in the boosting framework, namely the declarative bias em- 
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Table 2. Accuracy, standard deviation, average run-time and number of literals in the 
final hypotheses on the 188 — B 4 mutagenicity dataset and the QSARs dataset 







C^RIB 


FOIL 


Fors 


Progol 


Mutagenicity 


Accuracy ± StdDev 


88.0 ±6.0 


82.0 ±3.0 [13] 


89.0 ±6.0 [3] 


88.0 ±2.0 [13] 




0 Runtime (minutes) 


7 


n/a 


n/a 


307 




0 Number of literals 


64 


46 


n/a 


28 


QSARs 


Accuracy ± StdDev 


83.2 ±3.0 


82.9 ±2.7 


n/a 


79.8 ±3.7 




0 Runtime (minutes) 


57 


0.7 


n/a 


372 




0 Number of literals 


142 


140 


ll/a 


154 



ployed in form of a foreign literal restriction, the application of look-ahead, and 
the preclusion of hypotheses negatively correlated with the positive class, dis- 
tinguish our work from the approach of |2]. 

The ILP work probably closest related to our approach is that of Quinlan 
jS]. However, Quinlan uses, in conjunction with Adaboost.Ml, a standard ILP 
learner (FFOIL), so that the boosted ILP learner can be expected to produce 
fairly large run-times due to the high effort already expended by FFOIL. More- 
over, FFOIL itself generates as the first-order learner embedded into the boosting 
framework weak hypotheses each of which comprises a set of clauses. Thus, the 
resulting strong hypothesis is apt to be highly complex. Lastly, this approach 
works, due to the absence of a confidence measure, with equal voting weights 
for all weak hypotheses, and, instead of a probability distribution over the train- 
ing instances a re-sampling procedure is used to approximate the weights of the 
examples. 

The weak learner employed in our approach is based on the refinement oper- 
ator and declarative bias in form of foreign links introduced in Midos [14j . Ad- 
ditionally, a limited form of look-ahead has been employed in order to avoid the 
shortsightedness problem with respect to existential variables in the hypotheses 
generated by the greedy weak relational learner. 

7 Conclusion 

In this paper, we have presented an approach to boosting in first order learn- 
ing. Our approach, which we have termed constrained confidence rated boosting, 
builds on recent advances in the area of propositional boosting; in particular, 
it adapts the approach of Cohen and Singer Pj to the first order domain. The 
primary advantage of constrained confidence rated boosting is that the resulting 
rule sets are restricted to a much simpler and more understandable format than 
the one produced by unconstrained versions, e.g. AdaBoost.Ml, as it has been 
used in the only prior work on boosting in ILP by Quinlan [B] . On two standard 
benchmark problems, we have shown that by using an appropriate first order 
weak learner with look-ahead, it is possible to design a learning system that 
produces results that are comparable to much more powerful ILP-learners both 
in accuracy and in comprehensibility while achieving short run-times due to the 
simplicity of the weak learner. 
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These encouraging results need to be substantiated in future work, in partic- 
ular in the direction of examining other points in the power/run-time trade-off 
of the weak learner. The current weak learner has short run-times and already 
reaches comparable results to other non-boosted systems, but it appears possi- 
ble to make this weak learner slightly more powerful by adding in more of the 
standard elements of “full-blown” ILP-learners. While this would certainly slow 
down the system, it would be an interesting goal of further research to determine 
exactly the right balance between speed and accuracy of the weak learner. 

This work was partially supported by DFG (German Science Foundation), 
project FOR345/1-1TP6. 

References 

1. C. Anglano, A. Giordana, G. Lo Bello, and L. Saitta. An experimental evaluation 
of coevolutive concept learning. In J. Shavlik, editor. Proceedings of the 15th ICML, 
1998. 

2. W. Cohen and Y. Singer. A Simple, Fast, and Effective Rule Learner. Proc. of 16th 
National Conference on Artificial Intelligence, 1999. 

3. A. Karalic. First Order Regression. PhD thesis. University of Ljubljana, Faculty of 
Computer Science, Ljubljana, Slovenia, 1995. 

4. R.D. King, S. Muggleton, R.A. Lewis, and M.J.E. Sternberg. Drug design by ma- 
chine learning: The use of inductive logic programming to model the structure ac- 
tivity relationships of trimethoprim analogues binding to dihydrofolate reductase. 
Proceedings of the National Academy of Sciences of the United States of America 
89(23):11322-11326, 1992. 

5. R.D. King, A. Srinivasan, and M. Sternberg. Relating chemical activity to structure: 
An examination of ILP successes. New Generation Computing, Special issue on 
Inductive Logic Programming 13(3-4) :41 1-434, 1995. 

6. S. Muggleton. Inverse Entailment and Progol. New Generation Computing, 13:245- 
286, 1995. 

7. J.R. Quinlan. Bagging, boosting, and C4.5. In Proc. of Ifth National Conference 
on Artificial Intelligence, 1996. 

8. J.R. Quinlan. Boosting First-Order Learning. Algorithmic Learning Theory, 1996. 

9. J.R. Quinlan and R. M. Cameron-Jones. FOIL: A Midterm Report. In P. Brazdil, 
editor, Proceedings of the 6th European Conference on Machine Learning, volume 
667, pages 3-20. Springer- Verlag, 1993. 

10. R.E. Schapire. Theoretical views of boosting and applications. In Proceedings of 
the 10th International Conference on Algorithmic Learning Theory, 1999. 

11. M. Sebag and C. Rouveirol. Resource-bounded Relational Reasoning: Induction 
and Deduction through Stochastic Matching. Machine Learning , 38:41-62, 2000. 

12. A. Srinivasan, S. Muggleton, and R. King. Comparing the use of background knowl- 
edge by inductive logic programming systems. Proceedings of the 5th International 
Workshop on Inductive Logic Programming, 1995. 

13. A. Srinivasan, S. Muggleton, M.J.E. Sternberg, and R.D. King. Theories for muta- 
genicity: A study in first-order and feature-based induction. Artificial Intelligence, 
85:277-299, 1996. 

14. S. Wrobel. An algorithm for multi-relational discovery of subgroups. In J. Kom- 
rowski and J. Zytkow, editors, Principles of Data Mining and Knowledge Discovery: 
First European Symposium - Proceedings of the PKDD-97, pages 78-87, 1997. 




64 



Susanne Hoche and Stefan Wrobel 



A Sample Output from C^RIB 



Table 3. A strong hypothesis obtained from C^RIB 



DEFAULT RULE: 
active(A). [-1.40575] 



POSITIVE RULES: 

active(A) •«— logp(A,C),C>2.0,logp(A,D),D<4.0. [0.00082336] 
active(A) lnmo(A,C),C> -2.0,lumo(A,D),D< -1.2. ]0. 0210132] 
active(A) logp(A,C),C>2.0. [0.115733] 

active(A) lnmo(A,C),C> -2.0,logp(A,D),D<3.0,atm(A,E,F,29,G). [0.175073] 

active(A) atm(A,C,D,35,E). [0.176489] 

active(A) atm(A,C,D,l,E). [0.197106] 

active(A) ringSize5(A,C). [0.215675] 

active(A) ^ atm(A,C,D,27,E). [0.231689] 

active(A) ^ lnmo(A,C),C< -1.2. ]0. 283592] 

active(A) lnmo(A,C),C> -2.0,atm(A,D,E,29,F). ]0. 355777] 

active(A) logp(A,C),C>5.0. [0.470995] 

active(A) bond(A,C,D,5). ]0. 582912] 

active(A) atm(A,C,D,26,E),atm(A,F,G,l,H),lnmo(A,I),I< -1.2. [0.584057] 

active(A) ^ atm(A,G,cl,D,E),bond(F,G,G,H). [0.763684] 

active(A) atm(A,G,D,26,E),logp(A,F),F>3.0. ]0. 778605] 

active(A) ^ atm(A,G,D,27,E),logp(A,F),F>2.0,logp(A,G),G<3.0. [0.832673] 

active(A) atm(A,G,D,27,E),ringSize5(A,F). [0.925553] 

active(A) ^ atm(A,G,D,230,E). ]0. 977438] 

active(A) logp(A,G),C>3.0,ringSize5(A,D). [1.00485] 

active(A) atm(A,G,D,16,E). [1.01437] 

active(A) atm(A,G,D,32,E),bond(F,G,G,2). [1.1001] 

active(A) carbon5aromaticRing(A,G). ]1.4434] 

active(A) -<r- bond(A,G,D,3). ]1. 46341] 

active(A) -<r- lnmo(A,G),G< -2.0. ]1. 64408] 

active(A) ■«— ringSize5(A,G),logp(A,D),D>4.0. ]1. 69492] 

active(A) atm(A,G,D,28,E). [1.69956] 

active(A) anthracene(A,G). [2.21461] 

active(A) carbon6Ring(A,G). [3.06628] 

active(A) phenanthrene(A,G). [3.55481] 
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Abstract. This paper shows a sound and complete method for inverse 
entailment in inductive logic programming. We show that inverse entail- 
ment can be computed with a resolution method for consequence- finding. 
In comparison with previous work, induction via consequence-hnding is 
sound and complete for finding hypotheses from full clausal theories, 
and can be used for inducing not only dehnite clauses but also non- 
Horn clauses and integrity constraints. We also compare induction and 
abduction from the viewpoint of consequence-finding, and clarify the 
relationship and difference between the two. 



1 Introduction 

Both induction and abduction are ampliative reasoning, and agree with the logic 
to seek hypotheses to account for given observations and examples. That is, given 
a background theory B and observations (or positive examples) E, the task of 
induction and abduction is common in finding hypotheses H such that 

BAH^E, ( 1 ) 

where B A it is consistent [IHI5IXI14J . While the logic is in common, they differ 
in the usage in applications. According to Peirce, abduction infers a cause of an 
observation, and can infer something quite different from what is observed. On 
the other hand, induction infers something to be true through generalization of 
a number of cases of which the same thing is true. The relation, difference, sim- 
ilarity, and interaction between abduction and induction are extensively studied 
by authors in [B]. 

Compared with automated abduction, one of the major drawbacks of au- 
tomated induction is that computation of inductive hypotheses require a large 
amount of search that is highly expensive. General mechanisms to construct hy- 
potheses rely on refinement of current hypotheses, which has a lot of alternative 
choices unless good heuristics is incorporated in search. We thus need a logically 
principled way to compute inductive hypotheses. One such a promising method 
to compute hypotheses Ed in © is based on inverse entailment, which transforms 
the equation m into 

BA^E\= -nH. (2) 

C. Rouveirol and M. Sebag (Eds.): ILP 2001, LNAI 2157, pp. 65- 1791 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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The equation GD says that, given B and E, any hypothesis H deductively follows 
from BA~>E in its negated form. For example, given Bi — {human(s)} and Ei = 
{mortal{s)'\, Hi = {Vx(human(x) D mortal{x))} satisfies ([I). In fact, Bi A 
-•El = {human{s),-<mortal{s)} ^ 3x{human{x) A ->mortal{x)) = -'Hi. The 
equation Q is seen in literature, e.g., El for abduction and m for induction. 

While the equation GD is useful for computing abductive explanations of 
observations in abduction, it is more difficult to apply it to compute inductive 
hypotheses. In abduction, without loss of generality, E is written as a ground 
atom, and each H is usually assumed to be a conjunction of literals. These 
conditions make abductive computation relatively easy, and consequence-finding 
algorithms HUmE! can be directly applied. 

In induction, however, E can be clauses and H is usually a general rule. 
Universally quantified rules for H cannot be easily obtained from the negation 
of consequences of B A~'E. Then, Muggleton m considered the so called bottom 
clause: 

-L{E, B) = {-'L I L is a literal and B A -<E ^ L}. 

A hypothesis H is then constructed by generalizing a sub-clause of -L{E, B), i.e., 

H^E(E,B). 

While this method is adopted in Progol, it is incomplete for finding hypothe- 
ses satisfying ([TJ. Then, several improvements have been reported to make in- 
verse entailment complete mm\ or to characterize inverse entailment precisely 
However, such improved inductive procedures are not very simple 
when compared with abductive computation. More seriously, some improved 
procedures are unsound even though they are complete. Another difficulty in 
the previous inductive methods lies in the facts: (i) each constructed hypothesis 
in H is usually assumed to be a Horn clause, (ii) the example E is given as a 
single Horn clause, and (iii) the background theory B is a set of Horn clauses. 
Finding full clausal hypotheses from full clausal theories have not been received 
much attention so far. 

In this paper, we propose a simple, yet powerful method to handle inverse en- 
tailment GD for computing inductive hypotheses. Unlike previous methods based 
on the bottom clause, we do not restrict the consequences of H A -'E to literals, 
but consider the characteristic clauses of B A~'E, which were originally proposed 
for AI applications (including abduction) of consequence- finding |TT]. Using our 
method, sound and complete hypothesis-finding from full clausal theories can be 
realized, and not only definite clauses but also non-Horn clauses and integrity 
constraints can be constructed as H . In this way, inductive algorithms can be 
designed with deductive procedures, which reduce search space as much as pos- 
sible like computing abduction. In this paper, we also clarify the relationship 
and difference between abductive and inductive computation. 

This paper is organized as follows. Section EJintroduces the theoretical back- 
ground in this paper. Section [3] reviews a consequence-finding method for abduc- 
tion. Sectional provides the basic idea called CF-induction to construct inductive 
hypotheses using a consequence-finding method. Section [3] compares induction 
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with abduction in the context of consequence-finding. Section [^discusses related 
work, and Section |3is the conclusion. The proof of the main theorem is given 
in the appendix. 

2 Background 

2.1 Inductive Logic Programming 

Here, we review the terminology of inductive logic programming (ILP). A clause 
is a disjunction of literals, and is often denoted as the set of its disjuncts. A 
clause {Ai, . . . , Am, ~'B \, . . . , where each Ai, Bj is an atom, is also written 

as Bi A ■ ■ ■ A Bn D Ai V • • • V Am- Any variable in a clause is assumed to be 
universally quantified at the front. A definite clause is a clause which contains 
only one positive literal. A positive {negative) clause is a clause whose disjuncts 
are all positive (negative) literals. A negative clause is often called an integrity 
eonstraint. A Horn clause is a definite clause or negative clause; otherwise it is 
non-Horn. A clausal theory if is a finite set of clauses. A clausal theory is full if 
it contains non-Horn clauses. 

A {universal) conjunctive normal form (CNF) formula is a conjunction of 
clauses, and a disjunctive normal form (DNF) formula is a disjunction of con- 
junctions of literals. A clausal theory E is identified with the CNF formula 
that is the conjunction of all clauses in S. We define the complement of a 
clausal theory, S = C\ A ■ ■ ■ A Ck where each Ci is a clause, as the DNF for- 
mula -'CiCTi V • • • V ~^Ck<Jk, where ->Ci = Bi A ■■■ A Bn A ->Ai A • • • A ~<Am for 
Ci = {Bi A • • • A D Ai V • • • V Am), and ai is a substitution which replaces 
each variable x in Ci with a Skolem constant This replacement of variables 
reflects the fact that each variable in -^Ci is existentially quantified at the front. 
Since there is no ambiguity, we write the complement of S as —'E. 

Let C and D be two clauses. C subsumes D if there is a substitution 9 such 
that C9 C D. C properly subsumes D ii C subsumes D but D does not subsume 
C. For a clausal theory E, p.E denotes the set of clauses in E not properly 
subsumed by any clause in E. 

Let B, E, and H be clausal theories, representing a background theory, (pos- 
itive) examples, and hypotheses, respectively. The most popular formalization 
of concept-learning is learning from entailment (or explanatory induction), in 
which the task is: given B and E, find H such that B A H \= E and B A H is 
consistent. Note that we do not consider negative examples in this paper. 

2.2 Consequence-Finding 

For a clausal theory E, a consequence of if is a clause entailed by E. We denote 
by Th{E) the set of all consequences of E. The eons equence- finding problem was 
firstly addressed by Lee in the context of resolution principle. Lee proved 
that, for any consequence D of E, the resolution principle can derive a clause C 
from E such that C entails D. In this sense, the resolution principle is said eom- 
plete for eonsequence-finding. In Lee’s theorem, “C entails D” can be replaced 
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with “C subsumes D” . Hence, the consequences of S that are derived by the 
resolution principle includes fiTh(E). The notion of consequence-finding is used 
as the theoretical background for discussing the completeness of ILP systems 

m- 

By extending the notion of consequence-finding, Inoue El defined char- 
acteristic clauses to represent “interesting” clauses for a given problem. Each 
characteristic clause is constructed over a sub-vocabulary of the representation 
language called a production field. In this paper, for the sake of simplicity, a 
production field V is defined as a set of distinguished literals. Let Th'p(E) be 
the clauses in Th{E) all of whose literals belong to V. Then, the characteristic 
clauses of E with respect to V are defined as: 

Carc{E,V) = pTh-p(E) . 

Here, we do not include any tautology —'LML (= True) in Carc{E, V) even when 
both L and -■L belong to V . Note that the empty clause □ is the unique clause 
in Carc{E, V) if and only if E is unsatisfiable. This means that proof-finding is a 
special case of consequence- finding. In the propositional case, each characteristic 
clause of if is a prime implicate of E. 

When a new clause C is added to a clausal theory if, some consequences are 
newly derived with this new information. Such a new and “interesting” clause is 
called a “new” characteristic clause. Formally, the new characteristic clauses of 
C with respect to E and V are: 

NewCarc{E,C,P) = fi[Th-p{E A C) - Th{E)] 

= Carc{E AC,V)- Carc{E, V). 

When a new formula is not a single clause but a CNF formula F = C\A - ■ - A Cm, 
where each Ci is a clause, NewCarc{E, F, V) can be decomposed into m NewCarc 
operations each of whose added new formula is a single clause m- 

m 

NewCarc{E, F,V) = p[ ^ NewCarc{Ei,Ci,V) ] , (3) 

i=l 

where ifi = if, and if^+i = Ei A Ci, for i = 1, ... ,m — 1. This incremental 
computation can be applied to get the characteristic clauses of E with respect 
to V as follows. 

Carc{E,V) = NewCarc{True,E,V). (4) 

Several procedures have been proposed to compute (new) characteristic claus- 
es. For example, SOL resolution E! is an extension of the Model Elimination 
(ME) calculus to which the Skip rule is introduced. In computing 
NewCarc{E,C,V), SOL resolution treats a newly added clause C as the top 
clause input to ME, and derives those consequences relevant to C directly. With 
the Skip rule, SOL resolution focuses on deriving only those consequences be- 
longing to the production field P. Various pruning methods are also introduced 
to enhance the efficiency of SOL resolution in a connection-tableau format m- 
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Instead of ME, SFK resolution jl] is a variant of ordered resolution, which is 
enhanced with the Skip rule for finding characteristic clauses. An extensive sur- 
vey of consequence-finding algorithms in propositional logic is given by Marquis 

m- 



3 Abduction as Consequence-Finding 

Abduction is elegantly characterized by consequence-finding as follows. We here 
denote the set of all literals in the representation language by C, and a set F of 
candidate hypotheses {abductive bias) is defined as a subset of C. Any subset iJ 
of F is identified with the conjunction of all elements in FI. Also, for any set T 
of formulas, T represents the set of formulas obtained by negating every formula 
in T, i.e., f = { ~^C\C €T}. 

Let El, . . . , En be a finite number of observations, and suppose that they 
are all literals. We want to explain the observations E = Ei A ■ ■ ■ A En from an 
abductive theory {B,F), where B is a, clausal theory representing a background 
theory and E is a set of ground literals representing an abductive bias. Then, 
H = Hi A ■ ■ ■ A Hk is an (abductive) explanation of E from {B, F) if: 

1. i? A {Hi A ■■■ A Hk) \= El A ■■■ A En , 

2. B A {Hi A ■ ■ ■ A Hk) is consistent, 

3. Each Hi is an element of F. 

These are equivalent to the following three conditions: 

V . B A {~'Ei V • • • V ~'En) ^ ~'Hi V • • • V ~'Hk , 

2'. B ^ -iLi V • • • V -iLfe , 

3'. Each -^Hi is an element of F. 

By 1', a clause derived from the clausal theory B A -•E is the negation of an 
explanation of E from {B, F), and this computation can be done as automated 
deduction over clauses in a way of “inverse entailment”. By 2', such a derived 
clause must not be a consequence of B before adding -lE. By 3', every literal 
appearing in such a clause must belong to F. Moreover, H is a minimal expla- 
nation from {B,F) if and only if -<H is a minimal consequence from B A ~>E. 
Therefore, we obtain the following result. 

Theorem 31 Let {B,F) be an abductive theory. The set of minimal explana- 
tions of an observation E from {B, F) is: 



NewCarc{B, ->E, V) , 
where the production field V is F. 

In the above setting, E is assumed to be a conjunction of literals. Extending 
the form of each example Ei to a clause, let if = A • • • A if„ be a CNF formula. 
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where each Ei is a clause. Then, -'E is a DNF formula. By converting ->E from 
DNF into the CNF formula E, NewCarc{B, F,V) can be computed by 

In Theorem[^ explanations obtained by a consequence-finding procedure are 
not necessarily ground and can contain variables. In implementing resolution- 
based abductive procedures, however, each variable in the CNF formula E is 
replaced with a new constant in the complement —'E through Skolemization. 
Then, to get a universally quantified explanation by negating each new char- 
acteristic clause containing Skolem constants, we need to apply the reverse 
Skolemization algorithm [^. For example, if ~<P{x, sky,u, sky) is a new char- 
acteristic clause where sky,skv are Skolem constants, we get the explanation 
\/y\/v3x3u P{x, y, u, v) by reverse Skolemization. 

4 Induction as Consequence-Finding 

In this section, we characterize explanatory induction by consequence-finding. 
Suppose that we are given a background theory B and examples E, both of which 
are clausal theories (or CNF) possibly containing non-Horn clauses. Recall that 
explanatory induction seeks a clausal theory p[ such that: 

B ^H'^E, 

B A H is consistent. 



These two are equivalent to 



B A ^E\= (5) 

B ^ ^H. (6) 

Like inverse entailment, we are interested in some formulas derived from B A ~'E 
that are not derived from B alone. Here, instead of the bottom clause E{B,E) 
in m, we consider some clausal theory CC{B, E). Then, the equation ([^ can 
be written as 



BA^E\=CC{B,E), (7) 

CC{B,E)\^^H. (8) 

The latter JS|) is also written as 

H \=^CC{B,E). (9) 

Also, by dS) and dH}, we have 

B'^CC{B,E). (10) 

By (0! CC{B, E) is obtained by computing the characteristic clauses of R A ~^E 
because any other consequence of H A ^E can be obtained by constructing a 
clause that is subsumed by a characteristic clause. Hence, 



Carc(B A ^E, V) ^ CC{B, E ) , 



( 11 ) 
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where the production field V (C C) is defined as some set of literals reflecting an 
inductive bias. When no inductive bias is considered, V is just set to £, which 
is the set of all literals in the first-order language. The other requirement for 
CC{B,E) is the equation (fTOl) . which is satisfied if at least one of the clauses 
in CC{B,E) is not a consequence of B\ otherwise, CC{B,E) is entailed by B. 
This is realized by including a clause from NewCarc{B,-'E,'P) in CC{B, E). 

In constructing hypotheses H from the clausal theory CC{B, E), notice that 
-<CC{B,E) is entailed by in ([9]). Since -iCC{B,E) is DNF, we convert it 
into the CNF formula F, i.e., F = -iCC{B, E). Then, iJ is constructed as 
a clausal theory which entails F, i.e., H \= F. There are several methods to 
perform such an inverse of ent ailment in ILP. Such a procedure to construct 
a new clausal theory that entails a given clausal theory is called a generalizer 
1210 In some case, reverse Skolemization in abduction works as a generalizer, 
but there exist other techniques to generalize clauses such as anti-instantiation 
(i.e., replacement of terms with variables), dropping literals from clauses, inverse 
resolution, addition of clauses, and Plotkin’s least generalization of multiple 
clauses. Note that applying arbitrary generalizer to F may cause an inconsistency 
of F[ with B. To ensure that B A H is consistent, the clauses of H must keep 
those literals that are generalizations of the complement of at least one clause 
from NewCarc{B,-'E,'P). 

Now, the whole algorithm to construct inductive hypotheses is as follows. 

Definition 41 Let B and E be clausal theories. A clausal theory H is derived 
by a CF-induction from B and E ii H is constructed as follows. 



Step 1. Compute Carc{B A ~^E, V)E 

Step 2. Construct CC{B, E) = C\ A ■ ■ ■ A Cm, where each Ci is a clause satis- 
fying the conditions: 

(a) Each Ci is an instance of a clause in Carc(B A -^E, P); 

(b) At least one Ci is an instance of a clause from NewCarc{B,-'E,P); 
Step 3. Convert ~<CC{B,E) into the CNF formula F; 

Step 4. H is obtained by applying a generalizer to F under the constraint that 
B A F[ is consistent. 



Example 41 The following theory is a variant of an example in [T|, and is often 
used to illustrate how the bottom clause is used in inverse entailment [2II22]. 
Consider 



B 2 = {cat{x) D pet{x)) A 

{small{x) A fluffy {x) A pet{x) D cuddly -pet{x)) , 

E 2 = (fluffy{x) A cat{x) D cuddly _pet{x)) . 

^ If “entailment” is replaced with “subsumption” here, the completeness in Theorem llll 
does not precisely hold. 

^ Since the number of characteristic clauses may be large or infinite in general, this 
step should be interleaved on demand with construction of each Ci at Step 2 in 
practice. 
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Then, the complement of if 2 is 

-•E2 = fluffy(skx) A cat(skx) A ->cuddly _pet{skx)) , 
and NewCarc{B2,-'E2,£) is 

-'E2 A pet(skx) A ->small{skx) ■ 

Let CC{B2, E2) = NewCarc{B2,-'E2, C). In this case, F2 = ~'CC{B2, E2) = 
_L(i?2, E2) hold. By applying reverse Skolemization to F2, we get the hypothesis 

H2-- 

fluffy{x) A cat(x) A pet{x) D cuddly -pet{x) V small{x). 

While in the above cited reference the subclause 

fluffy (x) A cat(x) D small (x) 

is adopted for a definite clause, H2 is the most-specific hypothesis in the sense 
of [ 17 ]. 



Example 42 This example illustrates the incompleteness of inverse entail- 
ment based on the bottom clause in m- Consider the background theory and 
the example: 



i?3 = even(O) A (-iodd(x) V even(s(x))), 

E3 = odd(s(s(s(0)))). 

Then, Carc^B^ A -■ifa, C) = B3 A -^E^. Suppose that CC{B3, Efl) is chosen as: 

even{ 0 ) A {~'odd{s{ 0 )) V even{s{s{ 0 )))) A -~'odd{s{s{s{ 0 )))) , 

where the second clause is an instance of the latter clause in B3, and the third 
clause belongs to NewCarc{B3,-iE3,£). By converting -tCC{B3, E3) into CNF, 
F3 consists of the clauses: 

-ieven( 0 ) V odd{s{ 0 )) V odd{s{s{s{ 0 )))), 

-ieuen(O) V -ieuen(s(s(( 0 ))) V odd{s{s{s{ 0 )))) . 

Considering the single clause: 

£[3 = ->even{x) V odd{s{x)), 

H3 subsumes both clauses in F3, so is the hypothesis. On the other hand, the 
bottom clause is 



T(i?3, E3) = -<even( 0 ) V odc?(s(s(s( 0 )))), 
from which H3 cannot be obtained by any generalizer. In fact, H3 ^ T(i?3,if3). 
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Below is the correctness result for clausal theories derived using CF-induction. 
The result implies not only the completeness but the soundness of CF-induction. 

Theorem 41 Let B , E and H be elausal theories. H is derived by a CF- 
induetion from B and E if and only if B /\ H \= E and B A FI is eonsistent. 

In Theorem both B and E may contain non-Horn clauses and integrity 
constraints. Also, the derived hypotheses H may be non-Horn. This result is an 
answer to the open question posed by US], as to whether a generalization of 
inverse entailment would be complete for arbitrary clausal background theories. 

5 Abduction vs. Induction 

CF-induction is realized by abductive computation. In fact, computing Carc{BA 
-'E, V) at Step I can be implemented by calling NewCarc operations incremen- 
tally in © and each of which can be regarded as computing abduction by 
Theorem EH 

Conversely, computing abduction is regarded as a special case of CF-induction. 

Theorem 51 Let {B, F) be an abductive theory. A conjunction H of literals is a 
minimal explanation of an observation E from {B,F) if and only if H is derived 
by a CF-induction from B and E in which the size of CC{B,E) at Step 2 is 1 
(m = 1) and reverse Skolemization is used as the generalizer at Step j. 

The set of all minimal explanations is characterized by Theorem 1311 and can 
also be obtained by slightly modifying CF-induction. Namely, every clause of 
CC{B,E) is taken from NewCarc{B,-'E,'P) at Step 2, and we do not have to 
convert ~'CC{B, E) into CNF at Step 3, and reverse Skolemization is used as 
the generalizer at Step 4. By Theorem ED each single conjunction -iCi obtained 
in this way is a minimal explanation of E. Then, the disjunction -<CC{B^E) 
of every ->Ci is also an explanation. Such DNF explanations are used in AI 
applications such as diagnosis [13] and computing circumscription nni. 

Thus, abduction and induction are very similar if we allow arbitrary form of 
clausal theories as hypotheses. There are three differences between them. First, 
the form of hypotheses in induction is CNF, while it is usually DNF in abduction. 
Second, at least one of the clauses in CC{B,E) is from NewCarc{B,^E,V) 
in induction, while all clauses in CC{B,E) must be in NewCarc{B,^E,V) in 
abduction. Third, reverse Skolemization is used as a generalizer in abduction, 
while other generalizers can be used in induction. No other difference exists 
between abduction and induction as long as their implementation is concerned 
in the context of consequence-finding. The next example illustrates the similarity 
between induction and abduction. 

Example 51 (2^ Let 

i ?4 = {dog{x) A small{x) D pet{x)), 

E 4 = pet{c). 
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be the background theory and the example. Then, 

NewCarc{B4,-tE4, C) = -ipet{c) A {-•dog{c) V -^small{c)) . 

Now, put CC{B4, E4) = NewCarc{B4,-tE4, C). Then, 

~'CC{B4, E4) = pet{c) V {dog{c) A small{c)), 

which is exactly the same as the minimal abductive explanations. Converting 
-<CC{B4, E4) into CNF, we have 

F4 = (dog{c) y pet{c)) A {small{c) \/ pet{c)). 

By applying anti-instantiation, we get the clausal theory 

7/4 = { dog{x) V pet{x), small{x) V pet{x) }. 

On the other hand, the next example shows the main difference between ab- 
duction and induction, which is the second one in the above differences. Namely, 
induction often utilizes consequences of B before adding ^E in the construction 
of CC{B, E). This operation is essential to associate observations with the back- 
ground theory in induction. Abduction does not need such consequences because 
they are redundant in virtue of the minimality of explanations. 

Example 52 m Let us consider the background theory and the example: 

S5 = white(swanl) , E^ = ->black{swanl) . 

Then, NewCarc{B^,->E^,C) = ~'Eq = black{swanl). Hence, ~<black{swanl) is 
the unique minimal abductive explanation of E^. In induction, on the other 
hand, let 

CC{B^, E^) = white(swanl) A black(swanl) , 

in which the first conjunct is the clause of B5. By anti-instantiating = 
-'CC{B^, E^), we can learn the integrity constraint: 

-iwhite{x) V ->black{x). 



6 Related Work 

CF-induction is obviously influenced by previous work on inverse entailment 
(IE), which was initiated by Muggleton [E]. The original IE allows Horn clauses 
for B and a single Horn clause for each of B and E. Even in this setting, however, 
the method based on T(H, E) is incomplete for finding H such that B A El \= E 
m- Muggleton m considers an enlarged bottom set to make IE complete, but 
the revised method is unsound. Furukawa | 7 ] also proposes a complete algorithm, 
but it is relatively complex. Yamamoto [22] shows that a variant of SOL reso- 
lution can be used to implement IE based on E{B,E). However, he computes 
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positive and negative parts in separately, where SOL resolution is used 

only for computing positive literals. Muggleton and Bryant [T^ suggest the use of 
Stickel’s PTTP for implementing theory completion using IE, which seems ineffi- 
cient since PTTP is not a consequence-finding procedure but a theorem prover. 
Compared with these previous works, CF-induction proposed in this paper is 
simple, yet sound and complete for finding hypotheses from full clausal theories. 
Instead of the bottom clause, CF-induction uses the characteristic clauses, which 
strictly include the literals in T(B,if). 

Yamamoto and Fronhofer firstly extend IE to allow for full clausal the- 
ories for B and E, and introduce the residue hypothesis for ground instances 
of B A -<E. Roughly speaking, a residue hypothesis corresponds to the enumer- 
ation of all paths in the matrix in Bibel’s Connection method. By contrast, 
CF-induction is realized by a resolution-based consequence-finding procedure, 
which naturally extends most previous work on IE, and can easily handle non- 
ground clauses. Compared with the procedure by 12 a, a merit of CF-induction 
is the existence of a production field V , which can be used to guide and restrict 
derivations of clauses by reflecting an inductive bias. 

7 Concluding Remark 

In this paper, we put emphasis on the completeness of inverse entailment in full 
clausal theories. To this end, we proposed CF-induction, which is sound and 
complete for finding hypotheses from full clausal theories. CF-induction per- 
forms induction via consequence-finding, which enables us to generate inductive 
hypotheses in a logically principled way. CF-induction can be implemented with 
existing systematic consequence-finding procedures such as SOL resolution m 
and SFK resolution [4|. We also clarified the similarity and difference between 
abduction and induction in the context of consequence- finding. 

There exist formalizations of induction other than explanatory induction in 
the literature on ILP, such as learning from interpretations (or satisfiability) [3], 
and descriptive induction mM- De Raedt |3] proposes a translation of learning 
from interpretations into learning from entailment, but the method requires the 
notion of negative examples. Lachiche m discusses various forms of descriptive 
induction, which can also be characterized by deduction from completed theories. 
The precise relationships between these different formalisms and consequence- 
finding need to be addressed in the future. 
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A Proof of Theorem 3.1 

We prove the correctness of CF-induction by giving its soundness and complete- 
ness. Let B, E and El be clausal theories. Section lA.ll shows that for any H 



76 



Katsumi Inoue 



derived by a CF-induction from B and E, it holds that B A H \= E and BAH 
is consistent. Section lA. 21 shows the converse, that is, if B A H \= E and BAH 
is consistent then H is derived by a CF-induction from B and E. 

In the following, we assume the language Ch for all hypotheses FTs. Usually, 
Ch is given as the set of all clauses constructed from the first-order language, 
but we can restrict the form of hypotheses by considering an inductive bias with 
a subset of literals/predicates. The following proofs can be applied to the case 
with any inductive bias. Then, the production field V is set to the complement 
of Lh- When Cr is the set of all clauses, V is given as £. We also assume the 
existence of a sound and complete generalizer at Step 4 of a CF-induction. 



A.l Soundness of CF-Induction 



Let iJ be a hypothesis obtained by a CF-induction from B and E. Then, by the 
definition of a CF-induction, there is a DNF formula CC{B, E) = C\ A - ■ ■ A Cm 
such that [a] H is obtained by applying a generalizer to the CNF representation of 
-•CC{B, E); [b] every Ci (i = 1, . . . , to) is an instance of a clause from Carc(B A 
^E,V)\ and [c] there is a Cj (1 < j < to) that is an instance of a clause from 
NewCarc{B,->E,'P). Then, by [b], for any Ci {i = 1,...,to), there is a clause 
Di G Carc{B A^E,V) such that B A^E |= Di and Di \= Ci. Obviously, it holds 
that B A ~^E \= Ci- Also, by [c], it holds that B Cj. Hence, 

B A —'E )= (7i A • • • A Cm and B ^ Ci A • • • A Cm- 



Now, let 



Then, 



F = —iCC{B, E) = —'Cl V • • • V —'em- 
B A —'E 1= —<F and B ^ —'F, 



which are equivalent to 



B A F \= E and B A F is consistent. 

Finally, H \= F holds by [a], which implies that BAH E. The condition that 
B A H is consistent is included in Step 4 of a CF-induction. □ 



A. 2 Completeness of CF-Induction 

Suppose that B A H \= E and B A H is consistent. Then, B A -•E |= -•H and 
B ^ -'H. We first show that -•H is entailed by Carc{B A -•E,'P). Suppose not, 
i.e., Carc{B A-'E,'P) ^ -'H. Then, H is consistent with Carc{B A-'E,'P). Here, 
by the definition of the characteristic clauses, Carc{B A -•E,V) C Th{B A -•E). 
Also, Carc{B A ~^E,V) belongs to the production field V. On the other hand, 
H belongs to Cr that is the complement of V. Hence, H does not belong to V. 
Then, H is consistent with BA-'E. This contradicts the fact that BA-'E ^ -•H. 
Therefore, 

Carc{B A ^E,V) h ^H. 

Hence, Carc{B A -'E, V) AH is unsatisfiable. Now, there are two ways to prove 
the completeness of CF-induction. 
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[A] Using Herbrand’s theorem, there is a finite set S of ground instances of 
clauses from Carc{B A ~^E,V) such that S A H is unsatisfiable. This set S can 
actually be constructed by a CF-induction. In fact, we can set S as CC{B, E). 
In other words, let us construct CC{B,E) = Ci A • • • A Cm at Step 2 of a 
CF-induction such that 

(a) each Ci (i = 1, . . . , m) is a ground instance of a clause from Carc{BA^E, V), 

(b) Cl A • • • A Cm AH is unsatisfiable. 

Then, there is a Cj (1 < j < m) that is a ground instance of a clause from 
NewCarc{B,-yE,V) (for this, see the discussion below the equation (fTTIi in Sec- 
tion 01 ). Finally, at Steps 3 and 4, H can be obtained by applying a generalizer 
(including anti-instantiation) to the CNF representation of ~'CC{B,E). □ 

[B] Using the compactness theorem, there is a finite subset S of Carc{B A~'E,V) 
such that S AH is unsatisfiable. In this case, S can also be constructed at Step 2 
of a CF-induction as CC{B, E) = C\ A ■ ■ ■ A Cm, where 

(a) every Ci (f = 1, . . . , m) is a variant of a clause from Carc(B A -•E, V), and 

(b) Cl A • • • A Cm AH is unsatisfiable. 

Then, there is a Cj (1 < j < m) that is a variant of a clause in NewCarc{B ,-^E ,V) 
as in the proof of [A]. In this case, however, we have to take care of variables in 
CiS. Taking the complement of a C^, each variable x in Ci becomes a Skolem 
constant skx in -'Ci, in which x is interpreted as existentially quantified. Some- 
times we need multiple “copies” of -'Ci in -~'CC{B,E) using different constants 
like ski, 6 tc, depending on how many times Ci is used to derive -<H from 
B A -'E. Then, at Steps 3 and 4, H can be obtained by applying a generalizer 
to the CNF representation of -'CC{B, E). □ 

Example A1 We now verify the completeness proof [B] by applying it to the 
theory in Example 02 ] while the proof [A] can easily be checked by following 
the way shown in Example 1421 This time, we choose a non-ground characteristic 
clause as 

CC'{B^,E^) = even{Q) A{-'odd{x)\/ even{s{x))) A->odd{s{s{s{{))))). 
Then, the complement of CC\B^, E^) becomes 

{-'even{0) V odd(skx) V odd{s{s{s{0))))) 

A (-<even(0) V -<even(s(skx)) V odd(s(s(s(0)}}}). 

Here, we need only one copy of 3x(odd(x) A -~'even(s(x))). The hypothesis H 3 = 
-'even(x) V odd(s(x)) entails -iCC'(i? 3 , C 3 ) because H 3 entails 

By[(-<even(0) V odd(y)} A (-<even(s(y)) V odd(s(s(s(0)))))]. 

To see this, take the substitution y/s(0) in the above formula. Then, H 3 sub- 
sumes both -iet>en(0) V odd(s(0)) and -ieuen(s(s(0))) V odd(s(s(s(0))), and thus 
entails -'CC {B 3 , E 3 ). 
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Abstract. Analysing the use of a Unix command shell is one of the 
classic applications in the domain of adaptive user interfaces and user 
modelling. Instead of trying to predict the next command from a history 
of commands, we automatically produce scripts that automate frequent 
tasks. For this we use an ILP association rule learner. We show how to 
speedup the learning task by dividing it into smaller tasks, and the need 
for a preprocessing phase to detect frequent subsequences in the data. 
We illustrate this with experiments with real world data. 

Keywords: sequences, adaptive user interface 



1 By Your Command 

For many years now, the Unix command shelli] is used by experienced and inex- 
perienced users to interact with their system. Although this shell exists for many 
years, it can still be considered a flexible and customisable user interface: aliases 
allow you to give easier names for certain commands, with scripts you can build 
new commands by combining existing commands, et cetera. Many users do not 
fully utilise the power of the shell, either because they do not know all the tools 
the shell can provide or they do not want to go through the effort of using the 
tools. 

Since the shell is such a complex and powerful user interface many people 
have investigated how users use this shell. Greenberg collected logs from 
168 different users, classified in four categories: computer scientists, experienced 
programmers, novice programmers and users performing no programming tasks 
(such as people from the administration et cetera) . He used the standard statis- 
tical techniques to analyse e.g. which history mechanism is most useful. 

Later on, people start using machine learning techniques to automatically 
analyse logs of shell use. One way to help people using the shell is by predicting 
the next command they will type, given the history of previous typed commands. 

^ although there exist a large number of different Unix shells, the differences between 
these shells are not relevant for this paper. Therefore we will talk about ‘the’ Unix 
command shell 
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These predictions (most techniques present a list of the n most likely commands, 
with n small) are presented to the user, who can accept the prediction or type 
another command. A simple but useful prediction technique is building a prob- 
ability table which stores for each command A and B the probability that the 
next command will be B given that the last command was A. Many variations 
on this technique have been investigated: using the n last commands instead of 
just the last one or using a decay factor in updating the probabilities to make 
the most recent information more important than older observations [^. Other 
approaches for predicting the next command use decision trees or combine 
different techniques |^. 

These techniques predict up to 75% of the commands correctly, and about 
50% when also the extra parameters of the command (switches, filenames et 
cetera) are predicted. The accuracy is even higher when multiple predictions are 
shown. Although this looks promising, there are some drawbacks: 

— the user often uses short commands |S]. For those commands, it is as little 
effort to type the command as to select or verify a prediction, so nothing is 
gained. 

— long commands are difficult to predict correctly. Verifying and correcting the 
prediction can take as much time as typing the command (or even more). 

— a special shell is necessary to integrate the results of the prediction, but users 
often want to keep using the shell they are used to. 

As a result, these techniques may be very useful to help people who have difficul- 
ties with typing (physically impaired or while performing other tasks), but for 
‘normal’ shell users, although it may result in less keystrokes, the time gained 
will probably be low. 



2 Automating the Automation 

Another way of helping shell users is by suggesting scripts that automate frequent 
performed tasks. The advantages over the previous approach are: 

— the user has to verify the proposed script only once, and can then used it as 
often as he likes without the need to verify or edit the script. 

— even a script that consist of only short commands can still reduce the users 
effort. 

— a system that suggests these scripts can be independent from the shell, so 
that the user can keep on using his familiar shell. 

— it is easy to exchange such proposed scripts with other users. 

— users can become more aware of their own behaviour. 

This approach also has some drawbacks: probably it does not save you as many 
keystrokes as the first approach and it takes a large history of typed commands to 
extract meaningful scripts. Both approaches can be combined, which eliminates 
some of the drawbacks of the scripting approach, but introduces some drawbacks 
of the predicting the next command approach. 
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To the best of our knowledge, only very few implementations of this scripting 
approach exist. Graph based induction is used in El to analyse which files 
are used as input or are produced as output by Unix commands. In this way, 
sequences of commands that operate on one file (or files produced from this file) 
are detected and transformed into scripts. This allows them e.g. to detect edit- 
compile-run sequences, but they are not able to detect sequences of commands 
that are not related by common file use such as Ipr -P<printer> <file> ; 
Ipq -P<printer> or sequences of commands that are not related by any resource 
at all such as starting a set of common applications right after logging in. 

Macro operators |7|8| are also related to the approach we presented above. 
When solving problems one starts from a start state and tries to get to a solution 
state by applying operators. Macro operators are a sequence of such operators, 
and the goal is to find a set of such macro operators such that solving the task 
with the basic operators and the macro operators takes less time than with 
the basic operators only. This relates to our approach: we look for sequences of 
actions, and we face the same length consideration: longer sequences solve bigger 
tasks, but can be used less often. However, there are some major differences 
between these macro operators and the shell scripts: a script can contain gaps 
(see section E3) and more important the actions in scripts have parameters, 
which makes this a relational problem. 

Why can this be seen as a relational learning problem? Commands are inter- 
related by their execution order (or time) , and each command is possible related 
to one or more parameters. This allows us to represent a shell log as a set of 
logical ground atoms. We translate these as follows: 

— commands are translated to stub/3 predicates, where the first argument is 
the order of execution (which is also a unique identifier), the second argument 
is the time of execution (expressed as the number of seconds past a certain 
fixed moment in time) and the third is the command itself. 

— parameters are translated to paramieter/3 predicates, where the first argu- 
ment is the identifier of the command, the second argument is the order of 
the parameter within that command, and the third is the parameter itself. 

For example, the shell log 

cp /etc/skel/.bashrc ~/.mybashrc 
emacs ~/.mybashrc 

would be translated to 

stub ( 1 , 987882636 , ’ cp ’ ) . 
parameter (1 , 1 , ’ /etc/ skel/ .bashrc ’ ) . 
parameter (1 , 2 , ’ ~/ .mybashrc ’ ) . 
stub (2, 987882639, ’emacs’) . 
parameter (2, 1 , ’ ~/ .mybashrc ’ ) . 

Notice that this representation has difficulties representing piped commands. 
Either such commands must be represented as two separate commands where 
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the first one writes to a imaginary named pipe and the second reads from this 
named pipe or as one command that has the second command as a parameter. 
We could also introduce a new predicate to handle this, but this will increase 
the search space. 

We can also represent a parameterised script as a logical conjunction. For 
instance a script for copying a file and then start editing that copy could be 
represented as 

stub(A,_, ’cp’) , parameter (A, 1 , FI) , parameter(A,2,F2) , 

B is A + 1, stub(B , _ , ’ emacs ’ ) , parameter (B, 1 ,F2) . 

In the next section we present how inductive logic programming can be used 
to detect such scripts and the problems we faced when applying them to real 
world data. In section 0 we evaluate the resulting system by analysing shell logs 
and conclude in section El 

3 Generating Scripts with Warmr 

Warmr [3] is an upgrade of the propositional Apriori algorithm to an algo- 
rithm that can detect association rules expressed in a subset of first order logic. 
Warmr does this in two phases: first it searches for all frequent patterns (i.e. 
legal conjunctions of literals as defined in a language) in the dataset, and in 
a second phase it combines these frequent patterns into association rules. This 
first phase can more formally be expressed as follows: 

Definition 1. Given a set of ground atoms (examples) E, a set of datalog clauses 
(background knowledge) B, a language C defining all legal patterns and m > 1 
(minimal occurrence) , find all I G C such that the query I succeeds for at least m 
examples in £ U B. 

The scripts we are looking for can be found with this first phase of the 
Warmr algorithm. The examples consist of the logs represented as facts as dis- 
cussed in the previous section, the language allows the stub/3 and parameter/3 
predicates (where these arguments can be constants as well as variables) . An ad- 
ditional predicate is added that allows two commands to link to each other. This 
predicate specifies when two commands are considered next to each other in a 
sequence. This can be that one command B must be executed right after com- 
mand A (such as in the example above) but it can also be that B is executed at 
most n commands after A or at most n time units after A. Because we can also 
provide background knowledge to the system, we can e.g. tell the system that Is 
-a and Is -all are exactly the same command or tell the system that it should 
not distinguish between different editor commands. We can also use background 
knowledge to split up filenames (provided as parameter) into directory, filename 
root and filename extension. 

In this way we can use Warmr to find the same sequences as those that 
could be found using the graph induction algorithm, but it has some advantages 
over the graph induction algorithm: 
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— additional information can be incorporated via the use of background knowl- 
edge (aliases, hierarchies,...), 

— any relation between parameters of any command in the sequence can be 
expressed or learned; not just equality between parameters of two consecutive 
commands 

The major drawback however is the execution time of this algorithm. For a 
small shell log of 500 commands, this algorithm needs more than 10 hour|^ of 
CPU time to find all sequences that occur at least 5 times. For larger logs the 
algorithm was even unable to finish due to lack of resources. However we can 
speedup this algorithm by splitting up the learning task and by using a better 
algorithm for one of the learning tasks. 



3.1 Speedup by Splitting Up the Learning Task 

Warmr builds its frequent patterns levelwise: the list of frequent patterns of level 
1 contains all frequent atoms in £. Each run each pattern P in the frequent set 
is extended to a pattern P A H for each atom H G £ if there exists no infrequent 
pattern Q that is more general than P A A. This means that Warmr will build 
patterns that include parameter/3 atoms for commands in incomplete frequent 
sequences or commands that will never belong to a frequent sequence. This 
problem can be solved by first constructing all frequent sequences of commands 
without taking the parameter information into account and then transform each 
of these sequences into a separate pattern discovery task to find the frequent 
parameter patterns. Usually the user is not interested in a script for each frequent 
pattern, but only in frequent patterns that are not part of a longer frequent 
pattern, so we filter these out before we start searching for the frequent parameter 
patterns. 

This split up dramatically reduces the size of the search space. Finding the 
frequent parameter patterns for a given frequent command pattern only takes 
a few seconds because frequent command patterns only contain few commands 
(compared with the total set of commands) and so the search space is small. 
Most of the time still goes into finding frequent command patterns. But since 
this is a propositional task, more efficient algorithms can be found. 



3.2 Speedup by Using the Minimal Occurrence Algorithm 

The way in which Warmr extends a frequent pattern is not optimised for finding 
frequent sequences in which the order of the items is important. We illustrate 
this with an example. Suppose that our dataset contains the frequent sequence 
abed. When WARMR has reached its second level, it has constructed these 
frequent patterns: 

^ all timings in this paper are on a Pentium III 800 MHz computer with 256 Mb 
memory, running linux and the Ace version of Warmr running ilProlog[T] 
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a -> b 

b -> c 

c -> d 

However, when Warmr extends the first pattern a -> b, it will extend it to 

a -> b -> a 

a -> b -> b 

a -> b -> c 

a -> b -> d 

and test each of these patterns on the dataset, to see that only the third succeeds. 
However, since we are looking for subsequent commands, we know that only a -> 
b -> c can be a valid extension. Moreover, if we know the identifiers at which 
the sequences a -> b and b -> c start, we can calculate from this the number 
of occurrences of a -> b -> c (together with the identifiers of a on which these 
sequences start) without looking into the dataset. 

The above idea is implemented in the minimal occurrence algorithm m- 
We use an algorithm based on this minimal occurrence algorithm to find all 
maximal frequent sequences, where we add identifiers for left and right parents 
to efficiently combine elements into longer sequences. 

Definition 2. Let s be a sequenee of n elements: es,ies^ 2 ^s, 3 ---Ss,n-i^s,n- ^ se- 
quenee s' (es\\es' ,p) of length p is a subsequence of the sequence s 
of length n if 
I < p < n 

and Vil <i<p, 3jT < j < n : Cs'^i = Csj 

andWijjl <i <j <p3k,ll <k<l<n: = Cs^k o,nd es>j = Csj- 

The frequency of a subsequence in a sequence is the number of different mappings 

from elements of s' into the elements of s such that the previous conditions hold. 

Notice that this a general definition, usually we will restrict the sequence by 
introducing a maximal gap size in the third requirement: 

Vil < i < p3k, 11 < k < I < n,l — k < gapsize + 1 : eg'^i = Cg^k and Cg'^i+i = 

Definition 3. A sequence s' is a maximal subsequence of frequency f of a se- 
quence s if s' is a subsequence of the sequence s with a frequency of at least f and 
there is no sequence s" , subsequence of s with frequency at least f, such that s' 
is a subsequence of s" . 

Each sequence s of length n > 1 has 2 frequent subsequences of length n — 1. 
We call this first subsequence leftparent(s) and the second rightparent(s) . To 
extend a sequence s of length 2 or longer, one just combines this sequence with 
a sequence t where the right parent of s is the left parent of t. However for 
sequences p and q of length one, there are no restrictions in combining them, ex- 
cept that p should occur before q (and the maximal gap size if defined). Because 
of this, for constructing the second level, we can do no better than just trying all 
combinations of 2 sequences from level one. The algorithm is shown in figure [T] 
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We give each sequence a unique identifier and we store and index the identifiers 
from left and right parents for each sequence. This allows us to retrieve possible 
extensions for a frequent subsequence very efficient. 



length = 2 

let S(l) be the set of all frequent subsequences of s of length 1 
Output = (j> 
foreach t, u G S(l) 

if et.i , e„,i is frequent in s w.r.t. maximal gap size 
add et.i, e„,i to S(2) 
while S(length) ^ (f) do 
S(length+1) = 4> 
foreach t G S (length) 

foreach u G S(length) such that leftparent{u) = rightparent{t) 
if €-t,l^t,2’"^t,length€-u,length is frequeut 
add it to S(length+1) 

foreach t GS (length) 

if t is not part of any element in S(length+1) 
add t to Output 
length = length + 1 
return Output 



Fig. 1. Calculating all frequent maximal subsequences of s 



This algorithm is also related to the cSPADES algorithm m which can 
search for sequences in large databases in an efficient way. However, we combine 
elements based on their left and right parent instead of combining elements with 
identical tails because this last approach results in a less efficient search when 
using maximal gap size constraints as we do. 

3.3 Gluing Everything Together 

How do we use all this information to transform a shell log into a set of shell 
scripts? First we remove from the logs all commands that are useless in scripts be- 
cause they can not change the state of the computer: Is , more , date et cetera. 
Of course, when the aim of the analysis is to get an insight in the behaviour of 
the user instead of producing scripts, this step can be skipped. 

Next we use our variant of the minimal occurrences algorithm on the al- 
tered log file. By varying the maximal gap size and the minimal frequency for a 
subsequence to be considered frequent one can control the number of frequent 
sequences. We also add to the algorithm the possibility do define constraints 
on the subsequences. These constraints allow us for example to reject sequences 
that use a command more than n times. We output the frequent subsequences 
as a set of freqep/4 predicates, where the first argument is the identifier of the 
subsequence, the second is the identifier of the mapping (the highest identifier 
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for a given subsequence is the frequency of that subsequence), the third argu- 
ment is the element identifier within the subsequence and the fourth argument is 
the element identifier within the input sequence. We use this representation be- 
cause it is independent of the ILP program that is going to use this subsequence 
information. 

freqepCl ,1,1,28) . freqepCl ,1,2,30) . 

freqepCl ,2,1,72) . freqepCl , 2 , 2,73) . 

This could be such a result from this algorithm, specifying that it found only 
one frequent subsequence, that occurred twice: element number 28 in the input 
sequence is the first element in the first occurrence of the subsequence, element 
30 is the second (so there is a gap of size 1). The second occurrence of the 
sequence is at position 72 and 73. 

In the next step we add the subsequence output to the log data that we 
transform to the representation discussed in section El We do not allow stub/3 
atoms to be used in the language (Warmr would again start looking for frequent 
sequences this way) but force the use of freqep/4 instead. The parameter/3 
and other predicates defined in background knowledge are also added to the 
language. We then let Warmr search for frequent patterns. An example of such 
a pattern is: 

freqepCl , Occ , 1 ,Posl) , paramieter (Posl , 1 ,Paraml) , baseCParaml , Basel) , 
ext (Paraml , ’tex' ) , freqepCl, Occ, 2, Pos2) , parameter CPos2 , 1 ,Param2) , 
extCParam2, ’dviO , baseCParam2, Basel) . 

Since Warmr outputs all frequent patterns, we first have to make sure we re- 
move all redundant rules. For instance together with the above rule, WARMRwould 
also produce next rules: 

freqepCl , Dec , 1 ,Posl) , paraimeter CPosl , 1 ,Paraml) , baseCParaml , Basel) , 
extCParaml,Extl) , freqepCl, Dec, 2, Pos2) , pareimeter CPos2, 1 ,Param2) , 
extCParam2,Ext2) , base CParam2, Basel) . 
freqepCl , Dec , 1 ,Posl) , paraimeter CPosl , 1 ,Paraml) , baseCParaml , Basel) , 
extCParaml, ’tex’) , freqepCl, Occ, 2, Pos2) , parameter CPos2 , 1 ,Param2) , 
ext CParam2 , ’ dvi ’ ) . 

freqepCl , Dec , 1 ,Posl) , paramieter CPosl , 1 ,Paraml) , baseCParaml , Basel) , 
ext CParaml , ’tex' ) , freqepCl, Occ, 2, Pos2) , parameter CPos2 , 1 ,Param2) . 

et cetera. 

Next the most specific frequent rules are translated to scripts: the commands 
are looked up in the input sequence and the remaining of the rules is parsed and 
written as a Unix function. Our first rule would be translated to: 

function giveyourownname C ) { 
latex $l.tex 
xdvi $l.dvi 



} 
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Now we sort the resulting scripts. Although length and frequency are impor- 
tant attributes of a script, we sort the scripts according to the frequency divided 
by the expected frequency of this subsequence in a sequence where the elements 
are put in random order. The higher this number, the more ’exceptional’ the 
sequence is. 

4 Experiments 

We compare these approaches on the Greenberg dataset 0. Tabled shows the 
timings of finding frequent subsequences for 5 user logs from the Greenberg 
dataset with the default minimal frequency of 0.01. We see that, as the number 
of commands grows, the execution time of both algorithms (expressed in seconds) 
increases. The number of maximal frequent subsequences found does not seem 
to influence the execution time. The minimal occurrences algorithm is about 7 
times faster than Warmr . 



Table 1. Timing results with maximal gap size 0 



file 


^commands 


#frequent seq. 


max. length seq. 


time Warmr 


time minocc 


non-5 


216 


4 


6 


1.2 


0.5 


non-16 


590 


26 


6 


13.8 


1.8 


non-23 


962 


21 


3 


21.7 


2.0 


non- 11 


1537 


13 


4 


30.4 


4.5 


non-4 


3608 


16 


4 


83.7 


13.1 



However, when we set the maximal gap size to 1, this changes the timings 
dramatically (see table . It takes Warmr almost 200 times as long as with 
gap size 0 to find the sequences for non-5. For non-16 this is nearly a factor 800 
slower. For larger datasets, Warmr was not able to compute all sequences due 
to lack of resources. The minimal occurrences algorithm does not suffer from 
this problem, it does not even double the execution time. 



Table 2. Timing results with maximal gap size 1 



file 


T^frequent seq. 


max. length seq. 


time Warmr 


time minocc 


non-5 


8 


6 


236 


0.7 


non-16 


70 


9 


9975 


2.6 


non-23 


31 


6 


- 


2.4 


non- 11 


63 


8 


- 


6.9 


non-4 


39 


6 


- 


18.2 



When we raise the maximal gap size to 2 (see table [3D Warmr is not able 
to find all frequent subsequences for even the smallest log file. While the other 
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algorithm finishes in about a minute or less, we see that not only the length 
of the input but also the number of frequent subsequences found influences the 
execution time. 



Table 3. Timing results with maximal gap size 2 



file 


#frequent seq. 


max. length seq. 


time minocc 


non- 5 


35 


9 


1.6 


non- 16 


607 


13 


27.1 


non-23 


99 


9 


5.6 


non- 11 


401 


11 


40.6 


non-4 


173 


10 


63.6 



Just for illustration purpose we show some of the scripts found: 

Ipr -P $1 $2 cd . . mkdir $1 

Ipq -P $1 rmdir $1 cd $1 

5 Conclusion 

In this paper we tackled the problem of creating shell scripts (a sequence of com- 
mands together with their (variable) parameters) from shell logs. Since this task 
can be formulated as a relational pattern discovery task, we used the Warmr 
algorithm. To speed up this task we separate the propositional subtask (detect- 
ing frequent subsequences of command stubs) from the relational task (detecting 
frequent patterns in the parameters of a frequent subsequence of commands) . We 
also presented a version of the minimal occurrences algorithm to find frequent 
subsequences in an efficient way. 

We compared Warmr with this new system on real world data. Warmr is 
considerably slower, and often fails to find all sequences due to lack of comput- 
ing resources. These experiments also show that shell logs do contain frequent 
subsequences, and even frequent subsequences of considerably length. 

Other possible applications for this technique are other sequence analysing 
tasks, such as analysing traces of visits to a website where each click is annotated 
with extra information. This can help the designer in understanding how users 
visit the site or allow for automatic site modification. The system can also be 
used in analysing other user interfaces: a prototype of the user interface is built 
and testers try this interface. All their annotated actions are logged and analysed. 
If important long subsequences of actions are found, the developer can consider 
redesigning the interface in such a way that these actions can be performed 
more easily. All applications were sequential annotated data is involved and were 
frequent subsequences provide important knowledge are potential application 
domains for this technique. 

But not only in applications where detecting sequences is the main task, also 
other applications can make use of efficient subsequence detection algorithms. 
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Analysis of molecular data is such an example: sequences of atoms can be im- 
portant features in such applications. 
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Abstract. The identification of evolutionary related (homologous) pro- 
teins is a key problem in molecular biology. Here we present a inductive 
logic programming based method. Homology Induction (HI), which acts 
as a filter for existing sequence similarity searches to improve their per- 
formance in the detection of remote protein homologies. HI performs a 
PSTBLAST search to generate positive, negative, and uncertain exam- 
ples, and collects descriptions of these examples. It then learns rules to 
discriminate the positive and negative examples. The rules are used to 
hlter the uncertain examples in the “twilight zone". HI uses a multi- 
table database of 51,430,710 pre-fabricated facts from a variety of bio- 
logical sources, and the inductive logic programming system Aleph to 
induce rules. Hi was tested on an independent set of protein sequences 
with equal or less than 40 per cent sequence similarity (PDB40D). ROC 
analysis is performed showing that HI can significantly improve exist- 
ing similarity searches. The method is automated and can be used via a 
web/mail interface. 



1 Introduction 

The identification of evolutionary related (homologous) proteins is a key prob- 
lem in molecular biology. Knowledge of a homologous relationship between two 
proteins, one of known function and the other of unknown function, allows the 
probabilistic inference that the proteins have the same function (as evolution gen- 
erally conserves function) . Such inferences are the basis of most of our knowledge 
of sequenced genomes. 

1.1 Homology Searches 

Protein homology is usually inferred by using computer programs to measure the 
similarity of two or more proteins. This is almost always done by comparing the 
two amino-acid strings of the proteins under consideration, and measuring the 
character-wise similarity between them. However there is generally mueh more 
information available which is ignored. 

Initially, sequence homology searches were done using dynamic programming 
I22E3. Due to the rapid growth of the sequence databases, dynamic program- 
ming became too time consuming and more efficient heuristic approaches were 
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developed. The most common programs are FASTA [1281 19j . BLAST [T] and PSI- 
BLAST mE\, as well as hidden Markov-model approaches such as SAM-T98 
|13|l . Such programs consume more CPU time than all other bioinformatic pro- 
grams put together. Although, these algorithms perform well for closely related 
homologous sequences, the results for more distantly related proteins are less 
reliable [2^, detecting only ~ 50% of all possible homologies. 

Here, we describe a procedure using background knowledge together with the 
protein’s amino acid sequence to induce homology. The basic idea is to collect 
as much information as possible for a protein and its likely homologous proteins, 
and then to infer homology using discriminatory inductive logic programming 
(ILP). We call this approach Homology Induction (HI). 

Related work to HI are BLAST PRINTS p2] and PrePRINTS BLAST Search 
jHHI j . PrePRINTS is a tool to aid biologists filter BLAST results by “decorating” 
the output of BLAST with keywords. A score is given based on a statistical anal- 
ysis of the frequency of keyword pairs. A similar approach is taken by SAWTED 
- structure assignments with text descriptions m, retrieving remote homologues 
using annotations in the SWISS-PROT database based on a text-similarity mea- 
sure. A different approach was used by Jaakola et. al employing the Fisher kernel 
method on top of a HMM as a discriminative method for detecting remote ho- 
mologues m- HI is distinguished from these approaches by its ability to use 
all available background knowledge, its more general learning ability, and by its 
more comprehensive experimental validation. 



2 Methods 

2.1 Homology Induction 

The HI approach is based on the following steps: (a) collection of possible ho- 
mologous proteins using an existing method of sequence similarity search (SSS); 
(b) accumulation of all available information for these proteins; (c) division of 
the possible homologues into closely related homologues (training set) and set of 
more remote homologues (twilight zone) and generation of negative examples; 
(d) induction of rules for the training set; and (e) application of these rules to 
a set od possible homologues in the twilight zone . The individual steps are 
explained in the following subsections. 

a) Similarity Search. A similarity search is performed and the result col- 
lected, which becomes a set of examples of possible homologous proteins. PSI- 
BLAST, the most used homology search program PITR] . was employed as the 
similarity search algorithm. PSI-BLAST has state-of-the-art accuracy |2S]. PSI- 
BLAST is essentially an iterative nearest-neighbour method (in sequence space). 
The result of a PSI-BLAST search is a list of possible homologues, sorted by their 
e- value m- The lower the e-value, the higher the probability that the match 
does not randomly occur in the database, which implies that the matches are 
evolutionary related. 
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b) Data Accumulation and Data Preparation. For each possible ho- 
mologous protein all available data from the database of pre-fabricated facts 
is collected. The information stored in the database was selected for relevance 
to the detection of homology, originating from a wide variety of bioinformatic 
sources. For each protein in the SWISS-PROT database jS] we collected: 

— Directly from SWISS-PROT: the description, keywords, organism’s classifi- 
cation, molecular weight, and database references (Prosite, HSSP, EMBL, 
PIR - excluding SCOP classifications). 

— The predicted secondary structure - we used the DSC method HU on single 
sequences (as a multiple sequence method would require a homology search). 

— The amino acid distribution for singlets and pairs of residues, as used by the 
PROPSEARCH algorithm [U]. 

— The predicted cleavage sites from the SignallP p5] . 

— The total hydrophobic moment assuming secondary structure 011]. 

— The length and starting point of local PSI-BLAST alignments. 

Assembling this information in one large table would, in principle, be possi- 
ble, but access to the data would be highly complex and inefficient. However, the 
assembly of such a table is required as the staring point for statistical, neural net- 
works, or standard machine learning. This limitation of standard learning tech- 
niques is known as the “multi-table problem”, i.e. learning from multi-relational 
data stored in multiple tables um- We therefore chose to represent this infor- 
mation in form of a database of datalog facts. 

To give an example of the database, we show how the predicted secondary 
structure was translated from a single string into a set of datalog facts. The 
secondary structure of a protein possesses three states: a-helix, /3-strand, and 
coil. If, for example a, protein has the following predicted secondary struc- 
ture: aaaaccccccaaaaacccccccPPP, this would translate into: the 1st a-helix 
secondary structure prediction is of length 4; the 1st coil secondary structure 
prediction is of length 6; the 2nd a-helix secondary structure prediction is of 
length 5; the 2nd coil secondary structure prediction is of length 7; and the 1st 
/3-strand structure prediction is of length 3. 

The translated SWISS-PROT database (version 39) is 1.21 Gb and contains 
51,430,710 facts for a total of 86,593 proteins. 

c) Division. The result of the initial similarity search is divided into two sub- 
sets: a set of positive examples (proteins which are almost certainly homologous), 
and a set of uncertain examples. As a threshold for this division, the algorithm 
uses the inclusion e- value used in a). A third set is generated, to supply a set 
of negative examples. These negative examples are randomly selected from a list 
of all SWISS-PROT 0 proteins, which do not occur in the set of positive or 
uncertain examples. 

d) Induction. The most natural solution to the multi-table problem is to use 
inductive logic programming (ILP) [21] . ILP is the form of machine learning that 
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is based on first-order logic and is particularly suitable for problems where the 
data is structured, there is a large amount of background knowledge, and where 
the formation of comprehensible rules is desirable. This is a key advantage of 
using ILP. We used the ILP system Aleph version 2.75 which is based on 
inverse entailment. Aleph (an the related program Progol) have been successfully 
applied to a variety of problems in computational biology, such as learning rules 
to obtain Structure- Activity Relationships (SARs) I14I15I , and protein topology 
prediction 1351 . Aleph searches for logic programs (rules) which are true for 
positive examples and not true for the negative examples. In HI the positive 
examples are the sequences known to be homologous by use of the SSS, and 
the negative examples are 1000 random sequences that are not homologous. We 
generated the set of negative examples from SWISS-PROT proteins not occuring 
in the list of possible homologues. As the problem of remote homology detection 
is a real world application, one cannot omit the possibility of errors in the data. 
To accommodate this possibility, Aleph was set to accept learning rules with up 
to a 15% noise. Furthermore, to avoid overfitting of the rules, a minimum of ten 
positive examples is required to allow to proceed to the induction step. Aleph 
is in general versatile, bringing together the power of first order logic and the 
possibility of using background knowledge. However, it is not very suitable for 
use directly on numerical values, as Aleph searches the lattice for each single 
value for one attribute; the search using numerical values can be inefficient, 
depending on the number of distinctively different values. Possible solutions are 
to introduce operators, such as less than and greater than or to use discretisation. 
We choose to discretise all numerical values into 10 levels. 

e) Application. After the rules were learnt they were applied to the uncertain 
examples. If the rule was true for an uncertain example, this was considered 
evidence, along with the weak sequence similarity towards identifying the ex- 
ample as homologous. We therefore used the rules to identify proteins which 
have uncertain evidence for homology based only on sequence, but have suffi- 
cient evidence based on sequence and the other information from our annotated 
deductive database. Following the induction step, initial results collected from 
PSI-BLAST are re-arranged according to the rules found by Aleph. This is done 
by modifying the original E- value reported by PSI-BLAST. The results covered 
by the rules found, are assigned with an lower E- value, while proteins not covered 
persist with the same value as before. This is done by multiplying the original 
E- value, received by PSI-BLAST, with a constant evidence factor (EF ; with EF 
< I). This approach is based on the assumption that if protein covered by a rule 
then this gives further evidence of homology. Hence, it should be moved further 
up the list of close homologous sequences found by PSI-BLAST. We call the 
resulting value E/zz-value. 

2.2 Experiment 

To assess the accuracy of homology detection it is necessary to have a ’’gold 
standard” set of known homologies. To test HI we used the systematic approach 
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of Park et. al. [27j , which uses a subset of the Structural Classification of Pro- 
tein (SCOP) database [2^. The SCOP database is a classification database of 
proteins of known function and structure. The subset used for testing includes 
all proteins in the SCOP database with equal or less than 40 per cent sequence 
similarity (PDB40D database). The evaluation of a SSSs is done by investigat- 
ing each entry in the PDB40D database and accumulating the results. In theory, 
a perfect homology search algorithm should be able to detect all homologous 
relationships in PDB40D. However, in most cases, unrelated, non-homologous 
proteins can be found in the so called twilight zone (errors of commission); while 
at the same time evolutionary related proteins are omitted (errors of omission) . 

It is clear that for a high cut-off value non-homologous sequences will be 
considered to be homologous; while for a low cut-off value homologous sequences 
will be considered not to be homologous. Using a fixed cut-off value together 
with a simple error rate would therefore be a crude measure for comparison 
of homology searches. A further reason for not using error rates as a measure 
of comparison, is the inability to take into account different costs for different 
types of prediction errors. In most inference problems, the cost of predicting an 
example to be negative when it is actually positive (an error of omission)^ is 
not equal to the cost of predicting an example to be positive when it is actually 
negative (an error of commission). To counter the problem of different costs 
and cut-off values. Receiver Operating Characteristic (ROC) curves are used 
to compare prediction and classification problems and where first introduced in 
signal detection |,‘Ibl>^l, 4:1141 . The value of a ROC curve is that: if one prediction 
method produces a curve to the left of another method, then the method to 
the left is superior. This conclusion is true regardless of the particular costs 
associated with errors of commission and omission (assuming linearity of costs). 
We use ROC curves as our main method for comparison between the standard 
method PSI-BLAST and HI. 

Two Different Test Set Ups. To illustrate that the HI approach is robust 
and can work equally well with data generated from sequence alone, two different 
setting were used. The first setting made use of the full database described 
in the previous section, called The second setting used purely sequence 

based information, called All entries from the database originating from 

SWISS-PROT, except moljweight and seqJength as this information could be 
computed from sequence, were removed. This left 19 possible terms to be learnt 
from. These two settings correspond to the two most common uses for homology 
searching: setting one - searching for a homologous protein with known structure; 
setting two - annotating a newly sequenced gene. 

3 Results 

3.1 Rules 

We performed the HI algorithm with the two different settings on the SCOP 
(PDB40D) database having 1,434 entries. 409 entries had less than 10 SWISS- 



96 



Andreas Karwath and Ross D. King 



PROT hits, and no rules could be learnt. In these cases the original PSI-BLAST 
results were considered to be the HI output. This left 1025 examples to be 
filtered. 



HI“** Setting. For the HI“^* setting, Aleph could induce rules for 1,015 PDB40D 
examples. For 14 examples no rules could be learnt, and the PSI-BLAST results 
were taken as output. For the remaining 1,015 cases we collected the results 
and applied the induced rules to the uncertain examples of each PDB40D entry. 
For 701 PDB40D entries Aleph induced one single rule, for 133 entries rule sets 
consisting of two separate rules, while producing three or more separate rules 
for 177. Altogether, HI“** produced 1851 rules for the 1,015 PDB40D entries, 
1030 using only one predicate. The most commonly used predicate of the single 
predicate rules was database references to other databases (db^ref), utilised by 
651 rules. These rules consisted mainly of references to Prosite (639) and some 
to the HSSP database (12). This can be expected as both databases cluster 
homologous families of proteins together. 



Hjseg Setting. For the setting, Aleph could only induce rules for 949 

PDB40D entries, compared to 1,015 for HI““. As before, the original PSI-BLAST 
output was taken in cases where no rules could be learnt. Although, this seems 
to be relatively similar to the number of rules learnt before, the distribution of 
the number of the rules in a rule set is very different. Only 371 examples could 
be explained with one single rule. A further 228 examples had 2 rules, and 137 
examples had 3 rules. For 213 examples, a rule set with more than 3 rules was 
induced. 

To give an example of how differently the two HI setting induced rules, we 
present the two results of the induction of Myoglobin with the PDB entry IMBD. 



3.2 ROC Analysis for and PSI-BLAST 

The first method we investigated to compare PSI-BLAST and HI was based 
on the concepts of precision, recall and accuracy from information retrieval [31]. 
This comparison is more elementary than that of ROC curves. Table[l]shows the 
precision and recall for PSI-BLAST, HI“^* and HF®"^ using a cut-off E- value of 10. 
The accuracy measure for all methods is very high, as the PDB40D database has 
8022 true homology relationships, and 2,046,900 false ones. This makes the mea- 
sure of accuracy inappropriate, as the number of negative relationships compared 
to the number of positive relations is very small. Although both HI accuracies are 
higher than PSI-BLAST, it is not clear at first sight if it is significantly higher. 
To test significance we performed a two-sample f^st to compare the actual 
frequency of a prediction with the estimated frequency of the prediction. For 
HF** the value is 45.35 and for HF®'^ is 47.85. Comparing these values with 
the critical values from a significance table m, indicate that both methods 
are independent from each other. The critical value of x^ for 1 degree of freedom 
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PDB IMBD Myoglobin (Deoxy, pH 8.4) 

SCOP 1.1. 1.1.4 Hr“: 

Rules found: 

1 homologous(A) db_ref(A, prosite, ps01033,B,C). 



Fig. 1. The rule induced for the Myglogin IMBD, considers all proteins to be ho- 
mologous which possess a database reference connecting them to the Prosite sequence 
pattern database entry PS01033. The PS01033 pattern is the GLOBIN pattern. Globins 
are heme-containing proteins involved in binding and/or transporting oxygen and are 
a well studied group of proteins. Myoglobin are a type of globin responsible for storing 
oxygen in vertebrates. The only positive example not included in this rule is a leghe- 
moglobin, fitting the Prosite pattern PS00208 for plant globins. Two out of four possible 
uncertain examples are covered by this rule, these are both hemoglobins. Hemoglobins 
belong to the same family as the query sequence and are homologous to myoglobins. 
The two uncertain examples not covered by this rule are a protein associated with 
microtubule, and a serine hydroxymethyltransferase protein, neither are homologous 
to myoglobins. In summary the HI rule uses the prosite information to improve on the 
standard PSTBLAST method. 

Table 1. Precision, recall and accuracy for all three methods. 



Method 


Precision 


Recall 


Accmracy in per cent 


PSI-BLAST 


0.34 


0.717 


99.68991 




0.32 


0.787 


99.70072 


Hpe, 


0.30 


0.789 


99.69449 



and 99.995% confidence is 7.879, which indicates that and HP“ are both 
significantly better than PSI-BLAST. 

This test is based on one cut-off value (i.e. one set of costs). To test all linear 
costs we performed a ROC analysis. Both HI set-ups (HP*^ and HP®"^) were 
compared with PSI-BLAST. 

With HI the results for the uncertain examples are re-organised according to 
their e-values. This is done by multiplying the original e-value, received by PSI- 
BLAST, with a factor. To optimise this factor, the area under ROC curve (AU- 
ROC) |4I29I30| was calculated for possible factor settings to maximise the area. 
This approach was taken instead of a full cross-validation, as cross-validation 
would be computationally prohibitive and the large size of the database makes 
the estimate robust. 

A variety of different factors were used, starting with 9 x 10“^, ending with 
1 X The initial step for changing a factor is 0.1, resulting in the AUROCs 

calculated for the factors 0.9, 0.8, 0,7, ..., 0.1 . Then the factors were changed by 
an order of magnitude to 0.09, 0.08, 0.07, ... , 0.01. Figure E] shows the different 
results from this analysis. For HP*^ the AUROC peaked at 2 x 10“^, while for 
HP®"^ it peaked at only 8x 10“^. The maximum AUROC value for HPZ/ is 0.7571, 
while the maximum AUROC for HP eg is 0.7515. The AUROC for PSI-BLAST 
is 0.7508. 
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PDB IMBD Myoglobin (Deoxy, pH 8.4) 
SCOP 1.1. 1.1.4 



Rules found: 7 

1 homologous (A) 

2 homologous (A) 

3 homologous (A) 

4 homologous (A) 

5 homologous (A) 

6 homologous (A) 

7 homologous (A) 



amino_acid_ratio_rule(A,a,10), 

amino_acid_pair_ratio_rule(A,l,s,10), 

amino_acid_pair_ratio_rule(A,a,l,10). 

moLwt _rule ( A , 6 ) , 

sec_struc_coil_rule(A,5,8), 

sec_struc_coil_rule( A, 1 ,6) . 

sec_struc_rule(A,a,4,9), 

sec_struc_rule( A,a,4, 10) . 

sec_struc_alpha_rule(A,7,10), 

sec_struc_alpha_rule(A,6,l). 

sec_struc_distribution_rule(A,c,5), 

sec_struc_distribution_rule( A,b, 1) , 

sec_struc_conf_rule( A, 10) . 

domain_rule(A,4,9,4,4). 

signalipl_rule(A,9), 

sec_struc_beta_rule(A,4,6). 



Fig. 2. This complex set of rules consists out of seven separate rules. The first rule 
translates follows: Consider every example to be homologous, if its amino acid se- 
quence has a high Alanine content, very high frequencies of Leucine- Serine, and very 
high frequencies of Alanine- Leucine. The second rule translates similarly to: Consider 
every example to be homologous, if it has a molecular weight slightly above the median, 
a relatively long predicted secondary coil structure in the middle of the sequence, and a 
medium length predicted secondary coil structure at the beginning of the sequence. Al- 
though HI produced an extensive amount of rules, none of the four uncertain examples 
are covered by the rule set. 



The resulting ROC curves using the optimal re-sorting factor can be seen in 
figure m together with the ROC curve for PSI-BLAST. The dominating curve is 
the curve produced by HI““, being to the left of the other two curves. Although 
the ROC curve of HP®® produced a higher AUROC than the curve produced 
by PSI-BLAST, it does not entirely dominate the PSI-BLAST ROC curve. For 
large sections of the false positive axis, the two curves have a similar true positive 
rate. Only in the false positive rate interval of 0.38 to 0.5 does the ROC curve 
produced by dominate the one produced by PSI-BLAST. 

4 Web Server 

The HI method is available for use in as web/mail server (http://www.aber.ac.uk/ 
~phiwww/hi_V2/index.html). To the best of our knowledge this is the first in- 
ternet server providing an ILP service. The server is a simple HTML form. 



An Automated ILP Server in the Field of Bioinformatics 



99 



Comparision of the two Hi methods' AUROC values using different sorting factors 




Sorting Factors 

Fig. 3. This figure shows the calculated areas under ROC curve for both HI methods 
(HI“** and for a variety of re-sorting factors. The two curves follow a very 

different pattern. The AUROC values for HP** increases steadily and reaches its maxim 
value at 6 X 10“® with a value of 0.651169; while the AUROC values for HP®® first 
increases and then decreases again with a peak at 8 x 10“^ with an AUROC value of 
0.613467. Comparing both methods with PSI-BLAST shows that HP®® has offers only 
a slight improvement over PSTBLAST with an AUROC value of 0.606764. In contrast, 
HP** increases the AUROC value by approximately 7.4 per cent. 



supplying the desired information to a CGI-Perl script. The user has the oppor- 
tunity to select the parameters of the initial PSI-BLAST search, like inclusion 
e- value, maximum e- value to be reported, number of PSI-BLAST iterations, and 
if a low complexity sequence filter should be used. The user is also offered the 
possibility to select a different e-value to divide between positive examples and 
examples in the twilight zone. In the induction step, it is possible to select which 
datalog facts to be used in the induction, as well as Aleph specific options, like 
the minimum number of positive examples required and the percentage of noise 
allowed. 

All this information, including the amino acid sequence in question, is passed 
to a CGI-Perl script checking for inconsistencies. In case the information is re- 
garded as being consistent, the script sends an email to a daemon which dis- 
tributes the job to a specified computer. At the moment it uses just one ma- 
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Maxiumum ROC curves for PSI-BLAST and the two HI methods 




false positive rate (1-specifity) 



Fig. 4. The three ROC curves produced by PSI-BLAST, Hr“, and HP®'? for predic- 
tions in the twilight zone. While the ROC curve for PSI-BLAST results from applying 
ROC analysis directly to the results produced, the ROC curves for both HI methods 
are maximised using an optimal value for re-sorting. The ROC curve for HP** dom- 
inates over the other two curves at all times; while the curves for PSI-BLAST and 
HP®^ oscillate around each other. HP®^ dominates the PSI-BLAST curve between ~ 
0.38 and ~ 0.5. 



chine, however, it is possible to have multiple parallel jobs running on different 
machines running. 

5 Discussion and Conclusion 

HI is a first step in the application of ILP to aid in the inference of homology 
by exploiting bioinformatic data other than the basic sequence. We have shown 
that HI is more sensitive than the state-of-the-art sequence method PSI-BLAST, 
and that HI performs better for all error costs. Although this result only shows 
that HI is an improvement over PSI-BLAST, the basic approach of HI is ap- 
plicable to all sequence-based homology search methods. We therefore expect a 
similar level of improvement over other methods such as Hidden Markov Mod- 
els. Many improvements are possible to HI. Other sources of bioinformatic data 
and more biological background knowledge could be used. For example: com- 
ment lines from SWISS-PROT could be included (although this would require 
a more refined computational linguistic analysis); database links to Medline ab- 
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stracts could be exploited, etc. Much of the data in the deductive database is 
still propositional in form, and this does not fully exploit the power of ILP. More 
background knowledge could be used to allow ILP to use: < and >, numerical 
neighbourhoods, hierarchies of keywords, phylogenic trees, etc. Cross-validation 
could be used to get better estimates of the accuracy of rules. Data mining 
algorithms such as WARMR could be used to pre-process the data to find 
frequent patterns which would make learning easier and more successful. Multi- 
ple theories could be learnt and combined, e.g. using boosting and bagging [TOlSl . 
Also different algorithms could be used and their predictions combined together 
1341 . We expect that these improvements would greatly improve the sensitivity 
of homology detection over the level achieved by HI. There is a need for new 
approaches to inferring homology. One of the most interesting results of Park et. 
al. m was how relatively uncorrelated the errors were from the three different 
homology prediction methods examined. This means that better results could be 
obtained by combining prediction methods when inferring homology. This can 
be seen as another example of the statistical principle stated in the introduction: 
all available relevant information should be used. We believe that HI should be 
seen in this light. It is far from being the last word in inferring homology, but it 
is a valuable new approach. 
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Abstract. First order probabilistic logics combine a first order logic 
with a probabilistic knowledge representation. In this context, we in- 
troduce continuous Bayesian logic programs, which extend the recently 
introduced Bayesian logic programs to deal with continuous random vari- 
ables. Bayesian logic programs tightly integrate definite logic programs 
with Bayesian networks. The resulting framework nicely seperates the 
qualitative (i.e. logical) component from the quantitative (i.e. the prob- 
abilistic) one. We also show how the quantitative component can be 
learned using a gradient-based maximum likelihood method. 



1 Introduction 



In recent years, there has been an increasing interest in integrating probability 
theory with first order logic leading to different types of “first order probabilistic 
logics” . One of the streams I22I20I11I15O concentrates on first order extensions 
of Bayesian networks |21| . i.e. it aims at integrating two powerful and popular 
knowledge representation frameworks: Bayesian networks and first order logic. 
When investigating the state-of-the-art in this stream (cf. [20jl 1 11 5pi 2J i. then 
there are two important shortcomings of the mentioned techniques. They either 
do not allow to model continuou^ random variables or do not use (logical) lan- 
guages that allow for functor symbols. Nevertheless, both of these features are 
highly desirable for true “first order probabilistic logics”. Indeed, almost every 
real-world domain, including biology, medicine and finance involves continuous 
variables, and also, domains involving a potentially infinite number of random 
variables occur quite naturally in practice (e.g. temporal processes), which re- 
quires the use of functors to model the domain. 

The first contribution of this paper is the introduction of continuous Bayesian 
logic programs, which allow to model infinite domains using functors as well as 
continuous random variables. The semantics of these Bayesian logic programs is 
given in the context of discrete-time stochastic processes. Because, as we have 
argued in [12j . (discrete) Bayesian logic program^ can serve as a kind of com- 
mon kernel of first order extensions of Bayesian networks such as probabilistic 

^ We understand in this paper a continuous variable as a variable having K or a 
compact interval in ffi as domain. A discrete random variable has a countable domain. 
^ Discrete Bayesian logic programs are Bayesian logic programs allowing only for dis- 
crete random variables, see [12J . 
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logic-programs [20] . relational Bayesian networks m and probabilistic relational 
models [15], continuous Bayesian logic programs are novel. They generalize dy- 
namic Bayesian networks, Kalman filters, hidden Markov models, etc. 

The second contribution of this paper addresses the famous question: “where 
do the numbers, the parameters of the quantitative aspects come fromf . So far, 
this issue has not yet attracted much attention in the context of first order ex- 
tensions of Bayesian networks (with the exception of I16I7I ). In this context, we 
present for the first time how to calculate the gradient for a maximum likelihood 
estimation of the parameters of Bayesian logic programs. This gives one a rich 
class of optimization techniques such as conjugate gradient and the possibility 
to speed up techniques based on the EM algorithm, see m- 

We proceed as follows. After motivating continuous Bayesian logic programs 
with a simplified example from quantitative genetics in Section 3 we introduce 
continuous Bayesian logic programs in Section 4. In Section 5 we formulate 
the likelihood of the parameters of a Bayesian logic program given some data 
and, based on this, we present in Section 5 a gradient-based method to find 
that parameters which maximize the likelihood. After discussing related work in 
Section 7 and reporting experimental experiences in Section 8 we conclude. We 
assume some familiarity with logic programming (see e.g. [H]) as well as with 
Bayesian networks m- 

2 Quantitative Genetics 

A natural domain where Bayesian logic programs should help is genetics. Here, 
the family relationship forms the basis for the dependencies between the ran- 
dom variables and the biological laws provide the probability distribution. Even 
if the genotype may best be modeled using discrete random variables, some phe- 
notypes such as the height of a person are naturally represented using continuous 
variables. Moreover, in many situations phenotypes can be influenced by envi- 
ronmental (continuous) quantities such as the amount of nuclear radiation. The 
subfield of genetics which deals with continuous phenotype is called quantitative 
genetics (cf. |6]). 

As an example consider a simplified model of the inheritance of the heights 
of persons. The height h{X) of a specific person X interpreted as a continuous 
random variable (having the dom(h) = M) depends on its genotype g{X), a dis- 
crete random variable. The genotype g{X) itself depends on the genotype of the 
mother g{M) and father g{F). Furthermore, we could assume that h{X) is influ- 
enced by the heights of its mother h{M) and father h{F). Figure [Ushows a graph 
modelling the described dependencies. The graph can be seen as the dependency 
structure of a Bayesian network m- Thus, we are interested in representing the 
joint probability density function (jpdf) p{g{X),g{M),g{F),h{X),h{M),h{F)). 
Let p denote a probability density and P a probability distribution. The chain 
rule of probability states p{x\, . . . ,Xn) = YYi=\P{xi \ Xi-i, . . . ,xi) for a set 
{xi, . . . ,Xn} of random variables. The known biological dependencies express 
conditional independency statements such as g{X) is conditional independent 
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Fig. 1. The dependencies of the genetic domain. The height h{X) of a person X de- 
pends on the heights h{M), h{F) of its mother M and father F. 



from h{X) given a joint state of g{M), g{F), i.e. p{g{X),h{X) \ g{M),g{F)) = 
p{g{X) I g{M),g{F)). From the chain rule and the biological laws it follows that 

p{g{X),g{M),g{F), h{X),h{M),h{F)) = 

p{g{X) I g{M),g{F)) ■ p{g{M)) ■ p{g{F))- (1) 

p{h{X) I h{M),h{F))-p{h{M)) -p{h{F)). 

Thus, we must define an infinite set of densities for h{X), one for every possible 
joint value u of its parents Pa(/i(X)) = {h{M),h{F)} (the direct predecessors 
of a variable in the dependency graph) can take. Hence, for each u G dom(/i(X)) 
we have a function cpd{h{X) \ Pa{h{X))){u \ u) that denotes the conditional 
probability density p{h{X) = u \ Pa{h{X)) = u). We will call such a function 
a probability density function (pdf). Note, that the used upper types for X, F 
and M do not indicate that they are variables in a logical sense. The representa- 
tion so far is inherent propositional, i.e. the regularities cannot intensionally be 
represented. We have to describe it for each “family” X, M, F. The framework 
of continuous Bayesian logic programs aims at intensionally representing such 
regularities. 

3 (Continuous) Bayesian Logic Programs 

A Bayesian logic program B consist of two components, firstly a logical one, a 
set of Bayesian clauses (cf. below) that encodes the assertions of conditional in- 
dependence in Equation (P, and secondly a quantitative one, a set of conditional 
probability functions and combining rules (cf. below) corresponding to that log- 
ical structure. In particular, a Bayesian (definite) clause c is an expression of 
the form 

A I Ai, . . . , An (2) 

where n > 0, the A, A\^ . . . , An are Bayesian atoms and all Bayesian atoms 
are (implicitly) universally quantified. We define head(c) = A and body{c) = 
{Ai,...,A„}. The differences between a Bayesian and a logieal clause are : 
(1) the atoms p{ti, ...,tm) and predicates arising are Bayesian, i.e. they have 
an associated domain dom(p), and (2) we use ” | ” instead of Further- 
more, most other logical notions carry over to Bayesian logic programs. So, 
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m(uta, John) . 
f (peter , John) . 
g(uta) . 
g(peter) . 

g(john) I m(uta, John) ,g(uta) ,f (peter , John) ,g(peter) . 
h(uta) I g(uta) . 

h(peter) I g(peter) . 

h(john) I g(john) , m(uta, John) ,h(uta) ,f (peter , John) ,h(peter) . 



Fig. 2. A Bayesian logic program which essentially encodes the Bayesian network in 
Figure [T1 Here, Uta and Peter are the parents of John. The Bayesian logic program is 
the grounded version of the Bayesian logic program in Figure 



we will speak of Bayesian predicates, terms, constants, functors, substitutions, 
ground Bayesian clauses, etc. For instance, consider the Bayesian clause ci h(X) 

I m(X,M) , h(M) where dom(m) = {true, false} and dom(h) = K. It says that 
the height of a person X depends on the height of its mother M. Intuitively, a 
Bayesian predicate generically represents a set of random variables. More pre- 
cisely, each Bayesian ground atom p(ti, . . . , tj^) corresponds to a random variable 
with dom(p(ti, . . . ,tm)) := dom(p). As long as no ambiguity occurs, we do not 
distinguish between a Bayesian predicate (atom) and its corresponding logical 
predicate (atom). 

In order to represent a probabilistic model we associate to each Bayesian 
clause c a probability density function cpd{c) encoding p{head{c) \ body{c)). 
It generically represents the conditional probability densities of all ground in- 
stances c9 of the clause c. In general, one may have many clauses, e.g. ci and the 
clause C 2 h(X) I f (X,F) ,h(F) and corresponding substitutions 9i, that ground 
the clauses Ci such that head{ci9\) = head{c292)- They specify cpd{ci9i) and 
cpd(c202), but one needs p{head{ci9\) \ hody{ci) U body{c 2 ))- The standard solu- 
tion to obtain the densities required are so called combining rules (see e.g. |20|1. 
functions which map finite sets {p{A \ An, . . . , Ain.) | * = 1, • ■ ■ , rn} of condi- 
tional probability functions onto one combined conditional probability function 
p{A \ Bi,.. .,Bk) with {Bi,. . .,Bk] C We assume that for 

each Bayesian predicate p there is a corresponding combining rule cr{p) , such as 
noisy _or in the case of discrete random variables or a linear regression model in 
the case of Gaussian variables. 

To summarize, a Bayesian logic program B consists of a (finite) set of 
Bayesian clauses. To each Bayesian clause c there is an associated conditional 
probability function cpd(c), and for each Bayesian predicate p there is exactly 
one associated combining rule cr(jp). 

The declarative semantics of Bayesian logic programs is given by the an- 
notated dependency graph. The dependency graph DG{B) is that directed graph 
whose nodes correspond to the ground atoms in the least Herbrand model LH(i?) 
(cf. below). It encodes the directly influenced by relation over the random vari- 
ables in LH(i3): there is an edge from a node x to a node y if and only if there 
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exists a clause c £ B and a substitution 9, s.t. y = head{c9)^ x £ body{c9) and 
for all atoms z appearing m cO ■. z £ LH(B). The least Herbrand model LH(B) 
consists of all proper random variables. It is defined as if B would be a logical 
definite program (cf. E]). It is the least fix point of the immediate consequence 
operatoil3(cf. [I7j) applied on the empty set. Now, to each node x in DG{B) 
the combined pdf is associated which is the result of the combining rule cr(p) 
of the corresponding Bayesian predicate p applied on the set of cpd{c9)’s where 
head{c9) = x and body{c9) C LH(i?). Thus, the dependency graph encodes sim- 
ilar to Bayesian networks the following independency assumption: 

each node x is independent of its non-descendants given a joint state of 

its parents Pa(a;) in the dependency graph. 

E.g. the program in Figure |3] renders h{john) independent from g{uta) given 
a joint state of g{john),h{uta),h{peter),m{uta,john), f(peterGohn). Using this 
assumption the following proposition holds: 

Proposition 1. Let B be a Bayesian logic program. If B fulfills (1) that 
LH(i?) 0, (2) that DG{B) is acyclic in the usual graph theoretical sense, and 
(3) that each node in DG{B) is influenced by a finite set of random variables 
then it specifies a unique probability density over LH(i3). 

proof sketch (For a detailed proof see fTSf. ). The least Herbrand LH(i?) always 
exists, is unique and countable. Thus, DG{B) uniquely exists, and due to condi- 
tion (3) the combined pdf for each node of DG{B) is computable. Furthermore, 
because of condition (1) a total order tt of DG{B) exists, so that one can see B 
together with tt as a stochastic process over LH(B). An inductions “along” tt 
together with condition 2 shows that the family of finite-dimensional distribu- 
tion of the process is projective (cf. see [1]), i.e the jpdf over each finite subset 
s C LH(i?) is uniquely defined and J p(s,x = y)dy = p{s). With that, the 
preconditions of Kolmogorov’s theorem [2 page 307] hold, and it follows that B 
given TT specifies a probability density function p over LH(i?). This proves the 
proposition because the total order tt used for the induction does not refer to 
any specific total order of DG{B). 

A program fulfilling conditions 1, 2 and 3 is called well-defined and we will 
consider such programs for the rest of the paper. One can think of Bayesian 
networks as a simple example of well-defined programs. Their graphically repre- 
sented dependencies are encoded as a finite propositional Bayesian logic program 
as shown in Figure [21 A program encoding the intensional regularities in our ge- 
netic domain is given in Figure O Some interesting properties follow from the 
proof sketch. 

® We assume that all clauses in a Bayesian logic program are range-restricted: all 
variables appearing in the conclusion part of a clause also appear in the condition 
part. This is a common restriction in computational logic, because then all facts 
entailed by the program are ground (cf. [HI)- 
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m(uta, peter) . 
f(john, peter). 
g(uta) . 
g(peter) . 

g(X) I m(M,X), g(M), f(F,X), g(F) . 

h(uta) I g(uta) . 

h(peter) I g(peter) . 

h(X) I g(X), m(M,X), h(M), f(F,X), h(F) . 

Fig. 3. A Bayesian logic program encoding the example in our genetic domain. 



— We interpreted a Bayesian logic program as a stochastic process. This places 
them in a wider context of what Cowell et. al. call highly structured stochastic 
systems (HSSS, cf. [3]) because Bayesian logic programs represent discrete- 
time stochastic processes in a more flexible manner. Well-known probabilistic 
frameworks such as dynamic Bayesian networks, first order hidden Markov 
models or Kalman Alters are special cases of them. 

— Together with the unique semantics for pure discrete programs m it is clear 
that hybrid programs, i.e. programs over discrete and continuous variables 
have a unique semantics. 

Moreover, the proof in [13j indicates the important support network concept. 
Support networks are a graphical representation of the finite-dimensional distri- 
bution (cf. (!]) and are needed for the formulation of the likelihood function (see 
below) as well as for answering probabilistic queries in Bayesian logic programs. 
The support network N of a, variable x G LH(i3) is defined as the induced subnet- 
work of S' = {a;} U {y \ y G LH(B) and y is influencing x}. The support network 
of a finite set {xi, . . . , Xk} C LH(B) is the union of the networks of each single 
Xi- Because we consider well-defined Bayesian logic programs, each x G LH(i?) 
is influenced by a finite subset of LH(B). So, it is provable that the support 
network of a finite set {a;i, . . . ,Xk} ^ LH(i?) of random variables is always 
a finite Bayesian network and computable in finite time. Because the support 
network N models the finite-dimensional distribution specified by S, any inter- 
esting probabilistic density value over subsets of S is specified by N. For the 
proofs and an effective inference procedure (together with an implementation 
using Prolog) we refer to [TrS] . 

4 Maximum Likelihood Estimation 

So far, we have assumed that there is an expert who designs a Bayesian logic 
program. This is not always the case. Often, there is no-one possessing neces- 
sary expertise or knowledge. However, we often have access to data. We focus 
here on the classical maximum likelihood estimation (MLE) method to learn the 
parameters of the associated probability density functions of a given Bayesian 
logic program. 
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Let i? be a Bayesian logic program consisting of the Bayesian clauses 
Cl, . . . , c„, and let D = {Di , . . . , D^} be a set of data cases. A data case G D 
is a partially observed joint state of some variables in LH(B). Examples of data 
cases are 



{m{peter, uta) = true, f {peter, john) = true,h{uta) — 165.98, h{peter) = 175.8}, 
{g{peter) =7,h{uta) = 165.98, /i(peter) = 174.4, fe(}o/in) = 170}, 

{h{uta) — 167. 9, h{john) =?}, 

where ’?’ stands for an unobserved state. The parameters A(ci) = 
{A(ci)i, . . . , X{ci)ei}, 6i > 0, affecting the associated pdfs cpd{ci) constitute the 
set A = Ui=i and the version of B where the parameters are set to A is 
denoted by B(A]3 Now, the likelihood L(D, A) is the probability of the observed 
data D as a function of the unknown parameters A: 

L(D,A):=Pb(D| A) = Pb(x)(D). (3) 

Thus, the search space H is spanned by the product space over the possible 
values of A(ci) and we seek to find A* = ^b(x)(D). Usually, B specifies 

a density function over a (countably) infinite set of random variables and hence 
we cannot compute by considering the whole dependency graph. But 

as we have argued at the end of the preceding section it is sufficient to consider 
the support network N{\) of the random variables occurring in D to compute 
Ps(x)(D). Thus, remembering that the logarithm is monotone we seek to find 

A* = maxlogPjv(x)(D). (4) 

aGTx 

In other words, we have expressed the original problem in terms of the parameter 
MLE problem of Bayesian networks. However, we need to be more careful. Some 
of the nodes in N{\) are hidden, that is, their values are not observed in E0. 
Furthermore, it should be noted, that not only L(D,A) but also N{\) itself 
depends on the data, i.e. the data cases determine the sufficient subnetwork 
of DG{B) to calculate the likelihood. On the one hand, this may affect the 
generalization of the learned program, but on the other hand, this is a similar 
situation as for “unrolling” dynamic Bayesian networks jS] or recurrent neural 
networks m- 

Finally, our learning setting can be used for MLE of the parameters of in- 
tensional rules only. Assume that if we observe h{john) then h{john) = 172.06 
holds. In this case, it is problematic to estimate the ML parameters of h{john). 
But, we can still estimate the ML parameters ofh(X) I g(X) based on the 
support network of the data cases: the intensional rules together with the data 

^ As long as no ambignities occur we will not distinguish between the parameters A 
themselves and a particnlar instance of them. 

® If all nodes are observed in each Ui G D, then simple counting is all that is needd 
for ML parameter estimation in Bayesian networks (see e.g. m- 
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Fig. 4. The scheme of decomposable combining rules. Each rectangle corresponds to a 
ground instance of a Bayesian clause (cp. definition of a combining rule). The node A 
is a deterministic node. 



cases induce a Bayesian network over the variables of the data cases. This is not 
surprising if one sees our learning setting as a probabilistic extension of the ILP 
setting learning from interpretation. For a discussion on how this analogy can 
be used for learning intensional Bayesian clauses (not only the parameters of the 
associated densities) we refer to [Hj. 

5 The Gradient 

How can we maximize the likelihood? A classical method for finding a maxi- 
mum of an evaluation function is gradient ascent, also known as hill climbing. 
Here, one computes the gradient vector Vx of partial derivatives with respect to 
the parameters of the pdfs at a given point A £ 'H. Then it takes a small step 
in the direction of the gradient to the point A -1- aVx where a is the step-size 
parameter. The algorithm will converge to a local maximum for small enough 
a. Thus, we have to compute the partial derivatives of Pjv(x)(D) with respect 
to some particular parameter A(c)t 0. For the sake of simplicity we will assume 
decomposable combining rulefl Such rules can be expressed using a set of sep- 
arate, deterministic nodes in the support network, as shown in Figure [H Most 
combining rule commonly employed in Bayesian networks such as noisy _or or 
linear regression are decomposable (cp. [TH]). 

Decomposable combining rules imply that for each node x G N there exist at 
most one clause c and a substitution 6 s.t. body{c9) C LH(P) and head{c9) = x. 
Thus, while the same clause c can induce more than one node in N, all of these 
nodes have identical local structure: the associated pdfs (and so the parameters) 

® In the algorithm, this requires an additional step. We have to make sure that (1) 
each cpd{c) maps into [0, 1], and (2) for each u G dom{head{c)) and for each u £ 
dom{body{c)) : cpd(c)(u, u)du — 1. This can be done by renormalizing Vx to 

the constrained surface before taking a step in the direction of Vx- 
^ In the case of more general combining rules the partial derivatives of a inner function 
has to be computed. This may be difficult or even not possible (in a close form). 



112 Kristian Kersting and Luc De Raedt 



have to be identical, i.e. V subst. 9 : cpd{c9) = cpd{c). As an example consider 
the clause defining h(X) and the nodes h(uta), h(peter) and h(john). This 
is the same situation as for dynamic Bayesian networks where the parameters 
that encode the stochastic model of state evolution appear many times in the 
network. 

In the following we will adapt a solution based on the chain rule of differ- 
entiation given in for dynamic Bayesian networks. For simplicity, we fix the 
current instantiation of the parameters A and, hence, we write B and fV(D). 
Applying the chain rule on @ yields 



91ogPjv(D) 

d\{c)t 



E 

subst. 9 with 

support(^c9^ 



aiogP^(D) 

dXic9)t 



( 5 ) 



where 9 refers to grounding substitutions and support{c9) is true iff {head{c9)'\\J 
body{ci9) C N. Assuming that the data cases Di G T> are independently sampled 
from the same distribution we can separate the contribution of the different data 
cases to the partial derivative of a single ground instance c9 resulting iii: 



dlogPiv(D) _ ^ d log Pn(Di) 
dX{c9)t ^ dX{c9)t 



/‘ + 00 /•-f 

■/ 

7 1 «/ — oo «/ — c 



°PN{head{c9) = u, body(cd) 
cpd(c6) (m, u) 



^ dudu (6) 



where u G dom(head(c9), u G dom( body(c9). The term pj\/(u,u | Di) cannot 
be exactly calculated for all kinds of distributions. If not, stochastic simulation 
techniques such as Markov chain Monte Carlo methods (see e.g. [3J Appendix B]) 
should help. Another often used solution is to restrict the types of the random 
variable. Most continuous Bayesian networks have used Gaussian distributions 
for the density functions (e.g. conditional Gaussian distributions ID). This can 
be done with Bayesian logic programs, too, so that the solution of the integrand 
in Equation m has a closed form which can be adapted from , and an inference 
engine for Bayesian logic programs can be used to get an exact solution. But still 
in general, the integrals in Equation (|^ are intractable. Here again, stochastic 
simulation algorithms may solve the problem. We finally would like to state 
the equations of the partial derivatives for pure discrete programs. Doing the 
same steps as before (Equations (EJ, (E|) and noting that the densities are now 
distributions parameterized by their entries yields: 



dlogPiv(D) PM{head{c6) — uj, body{c9) = | Di) 

dcpd(c)jk ^ ^ dcpd(c9)jk (7) 

^ ^ subst. e with i=i ^ ^ 

support{^cO') 

where Ui G dom(head(c)), Uj G dom(body(c)) and i,j refer to the corresponding 
entries in cpd(c) and cpd(c9). With this, it is not difficult to adapt the equations 



Due to space restrictions we leave the derivation of the equation out. It is basically 
the derivation of equation (10) in |2| adapted to our notation. 
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Table 1. A simplified skeleton of the algorithm for adaptive Bayesian logic programs.. 



function Basic- ABLP(B, D) returns a modified Bayesian logic program 

inputs: B, a Bayesian logic program; associated pdfs are parameterized by A 
D, a finite set of data cases 

A -i-InitialParameters 
N SupportNetwork(B, D) 

repeat until A\ « 0 

AA^ 0 

set pdfs of N according to A 
for each D* £ D 

set the evidence in N from Di 
for each clause c € B 

for each ground instance c9 s.t. {head{c6)} U body(c9) C N 
for each single parameter X{c9)t 

AX{c)t ^ AX{c)t + {dlogPN{Di)/dX{c9)t) 

AX ^ProjectionOntoConstraintSurface(AA) 

A ^ — A -f o ■ AX 
return B 



for hybrid Bayesian logic programs. A simplified skeleton of a gradient-based 
algorithm is shown in Table [H 

6 Related Work 

To some extent, Bayesian logic programs are related to the BUGS language [8] 
which aims at carrying out Bayesian inference using Gibbs sampling. It uses 
concepts of imperative programming languages such as for-loops to model reg- 
ularities in probabilistic models. Therefore, the relation between Bayesian logic 
programs and BUGS is akin to the general relation between logical and impera- 
tive languages. This holds in particular for relational domains such as that used 
in this paper: family relationships. Without the notion of objects and relations 
among objects family trees are hard to represent. Furthermore, a single BUGS 
program specifies a probability density over a finite set of random variables, 
whereas a Bayesian logic program can represent a distribution over an infinite 
set of random variables. 

There is work on parameter estimation within “first order probabilistic log- 
ics” which do not rely on Bayesian networks. Gussens [1] investigates EM meth- 
ods to estimate the parameters of stochastic logic programs m- Sato et al. m 
have shown that there is an efficient method for EM learning of PRISM pro- 
grams. 

Learning within Bayesian networks is well-investigated in the Uncertainty in 
AI community, see e.g. [S]. Binder et. al. [2] whose approach we have adapted 
present results for a gradient-based method. But so far, there has not been much 
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work on ML parameter estimation within first order extensions of Bayesian net- 
works. Koller and Pfeffer adapt the EM algorithm for probabilistic logic 
programs | 20| . a framework which in contrast to Bayesian logic programs sees 
ground atoms as states of random variables. Although the framework seems to 
theoretically allow for continuous random variables there exists no (practical) 
query-answering procedure for this case; to the best of our knowledge Ngo and 
Haddawy 1201 give only a procedure for variables having finite domains. Further- 
more, Koller and Pfeffer’s approach utilizes support networks, too, but requires 
the intersection of the support networks of the data cases to be empty. This 
could be in our opinion in some cases too restrictive, e.g. in the case of dynamic 
Bayesian networks. However, if the data cases fulfill the property then infer- 
ence is faster. Friedman et al. [2] concentrate on learning within probabilistic 
relational models, a framework that combines entity/relationship models with 
Bayesian networks. They adapt the EM algorithm to estimate the parameters 
and to learn the structure based on techniques known from Bayesian network 
learning. So, they consider a more general problem setting than we do, but on 
the other hand the entity /relationship model lacks the concept of functors. Thus, 
they are limited to finite sets of entities and relations. For a more detailed discus- 
sion of the relations of Bayesian logic programs to other first order extensions of 
Bayesian networks such as probabilistic logic programs ESI, relational Bayesian 
networks m and probabilistic relational models m we refer to msm- 

Therefore, the related work on first order extensions of Bayesian networks 
mainly differs in two points from ours: (1) The underlying (logical) frameworks 
lack important knowledge representational features which Bayesian logic pro- 
grams have. (2) They adapt the EM algorithm which is particularly easy to 
implement. However, there are problematic issues both regarding speed of con- 
vergence as well as convergence towards a local (sub-optimal) maximum of the 
likelihood function. Different accelerations based on the gradient are discussed 
in [1^. Also, the EM algorithm is difficult to apply in the case of general prob- 
ability density functions because it relies on computing the sufficient statistics 
(cf. [9]). 

7 Experimental Prospects 

The experimental results of m and [2j can be summarized as follows: (1) the 
support network is a good approximation of the entire likelihood, (2) equality 
constraints over parameters speed up learning, and (3) gradient-based meth- 
ods are promising. Therefore, we prove the basic principle of our approach by 
testing the hill-climbing algorithm on a simple model of our genetic domain. 
We generated 100 data cases from a version of the program in Figure El where 
the genetical information expressed by g is omitted. It describes the family tree 
of 12 person. The associated probability functions are cpd{m{M, X)){true) = 
cpd{f{F,X)){true) = 1.0, cpd{h{X)) = A/’(165, 20) and the one in Tabled] 
where A/”(165, 20) denotes a normal density with mean 165 and variance 20. The 
learning task was to estimate a in the function of Table |2] and the mean b of 
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m(M,A) f{F,X) 


cpd{c){h{X) 1 h{M),h{F)) 


true true 
true false 
false true 
false false 


A/"(0.0 -b 0.5 ■ h{M) + 0.5 ■ h{F), 20) 
^■(165, 20) 

^■(165, 20) 

^■(165, 20) 



Table 2. The probability density function used in our experiments. Only in the case 
true, true the heights of the parents a taken into account as weighted sum. The constant 
addend in the mean of the first normal densities is denoted as a. i.e. a = 0.0. 



cpd{h{X)) starting with a = 165.0 and b = 0.0. After 13 iterations the estimated 
parameters were a = —0.1856 and b = 164.7919 using a step-size of 1.0. The im- 
plementation obviously suffers from the well-known dependency on the chosen 
initial parameters and fixed step-size. In the future we will investigate more ad- 
vanced gradient-based methods like e.g. Conjugate-Gradient. We also conducted 
experiments with learning the weights of the sum in the function of Tabled i.e. 
0.5 and 0.5. Here, the algorithm converges to weights almost summing to 1.0 
which are local minima w.r.t. our data generating model and the likelihood. 

8 Conclusions 

We made two contributions. First, we have introduced continuous Bayesian logic 
programs. We have argued that the basic query-answering procedure for discrete 
programs is still applicable: The ability to represent both intensional regularities 
between the variables as well as continuous random variables reduces the size of 
many modelled domains. Second, we have addressed the question “where do the 
numbers come from?” by showing how to compute the gradient of the likelihood 
based on ideas known for (dynamic) Bayesian networks. The intensional repre- 
sentation of Bayesian logic programs, i.e. their compact representation should 
speed up learning and provide good generalization. 

In the future, we will perform a detailed comparison of our learning approach 
with the EM algorithm. Accelerations of the EM algorithm based on the gradi- 
ent are interesting. Our ultimate goal is learning the structure. We are currently 
(see m) looking for combinations of techniques known from Inductive Logic 
Programming, such as refinement operators, with techniques like scoring func- 
tions of the Bayesian networks. Just like ML parameter estimation is a basic 
technique for structural learning of Bayesian networks, it seems to be a basic 
technique for structural learning of Bayesian logic programs. 
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Abstract. Recently, new representation languages that integrate first 
order logic with Bayesian networks have been developed. Bayesian logic 
programs are one of these languages. In this paper, we present results 
on combining Inductive Logic Programming (ILP) with Bayesian net- 
works to learn both the qualitative and the quantitative components of 
Bayesian logic programs. More precisely, we show how to combine the 
ILP setting learning from interpretations with score-based techniques for 
learning Bayesian networks. Thus, the paper positively answers Roller 
and Pfeffer’s question, whether techniques from ILP could help to learn 
the logical component of first order probabilistic models. 



1 Introduction 

In recent years, there has been an increasing interest in integrating probability 
theory with first order logic. One of the research streams |24l22llll6li4] aims 
at integrating two powerful and popular knowledge representation frameworks: 
Bayesian networks [22 and first order logic. In 1997, Koller and Pfeffer |16] ad- 
dress the question ’‘‘‘where do the numbers come fromf for such frameworks. 
At the end of the same paper, they raise the question whether techniques from 
inductive logic programming (ILP) could help to learn the logical component 
of first order probabilistic models. In [15] we suggested that the ILP setting 
learning from interpretations mm is a good candidate for investigating this 
question. With this paper we would like to make our suggestions more concrete. 
We present a novel scheme to learn intensional clauses within Bayesian logic 
programs It combines techniques from ILP with techniques for learning 

Bayesian networks. More exactly, we will show that the learning from inter- 
pretations setting for ILP can be integrated with score-based Bayesian network 
learning techniques for learning Bayesian logic programs. Thus, we positively 
answer Koller and Pfeifer’s question. 

We proceed as follows. After briefly reviewing the framework of Bayesian 
logic programs in Section 2, we dicuss our learning approach in Section 3. We 
define the learning problem, introduce the scheme of the algorithm, and discuss 
it applied on a special class of propositional Bayesian logic programs, well-known 
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under the name Bayesian networks, and applied on general Bayesian logic pro- 
grams. Before concluding the paper, we relate our approach to other work in 
Section 5. We assume some familiarity with logic programming or Prolog (see 
e.g. mm) as well as with Bayesian networks (see e.g. [SHI). 

2 Bayesian Logic Programs 

Throughout the paper we will use an example from genetics which is inspired 
by Friedman et al. |B]: “it is a genetic model of the inheritance of a single gene 
that determines a person’s X blood type bt (X) . Each person X has two copies 
of the chromosome containing this gene, one, mc(Y), inherited from her mother 
m(Y,X), and one, pc(Z), inherited from her father f(Z,X).” We will use P to 
denote a probability distribution, e.g. P(a;), and the normal letter P to denote 
a probability value, e.g. P{x = v), where ■(; is a state of x. 

The Bayesian logic program framework we will use in this paper is based on 
the Datalog subset of definite clausal logic, i.e. no functor symbols are allowed. 
The idea is that each Bayesian logic program specifies a Bayesian network, with 
one node for each (Bayesian) ground atom (see below). For a more expressive 
framework based on pure Prolog we refer to [14j . 

A Bayesian logic program B consist of two components, firstly a logical one, a 
set of Bayesian clauses (cf. below), and secondly a quantitative one, a set of con- 
ditional probability distributions and combining rules (cf. below) corresponding 
to that logical structure. A Bayesian (definite) clause c is an expression of the 
form 

A I Al , . . . , A^ 

where n > 0, the A, Ai,...,A„ are Bayesian atoms and all Bayesian atoms 
are (implicitly) universally quantified. We define head{c) — A and hody{c) = 
{Ai,...,A„}. So, the differences between a Bayesian clause and a logical one 
are : (1) the atoms p{t\, ...,tn) and predicates p arising are Bayesian, which 
means that they have an associated (finite) domaiifl dom(p), and (2) we use 
” I ” instead of . For instance, consider the Bayesian clause c 

bt(X) I mc(X), pc(X). 

where dom(&f) = {a, 6, a&, 0} and dom(mc) = dom(pc) = {a, 6,0}. It says that 
the blood type of a person X depends on the inherited genetical information of 
X. Note that the domain dom(p) has nothing to do with the notion of a domain 
in the logical sense. The domain dom(p) defines the states of random variables. 
Intuitively, a Bayesian predicate p generically represents a set of (finite) random 
variables. More precisely, each Bayesian ground atom g over p represents a (fi- 
nite) random variable over the states dom(p) := dom(p). E.g. bt(ann) represents 

^ For the sake of simplicity we consider finite random variables, i.e. random variables 
having a finite set dom of states. However, the ideas generalize to discrete and 
continuous random variables. 
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the blood type of a person named Ann as a random variable over the states 
{a, b, ab, 0}. Apart from that, most other logical notions carry over to Bayesian 
logic programs. So, we will speak of Bayesian predicates, terms, constants, sub- 
stitutions, ground Bayesian clauses, Bayesian Herbrand interpretations etc. We 
will assume that all Bayesian clauses are range-restricted. A clause is range- 
restricted iff all variables occurring in the head also occur in the body. Range 
restriction is often imposed in the database literature; it allows one to avoid 
derivation of non-ground true facts. 

In order to represent a probabilistic model we associate with each Bayesian clause 
c a conditional probability distribution cpd{c) encoding P(/iead(c) | hody{c)). To 
keep the expositions simple, we will assume that cpd(c) is represented as table, 
see Figure m More elaborate representations like decision trees or rules are also 
possible. The distribution cpd(c) generically represents the conditional probabil- 
ity distributions of all ground instances c9 of the clause c. In general, one may 
have many clauses, e.g. clauses ci and the C 2 

bt(X) I mc(X). 

bt(X) I pc(X). 

and corresponding substitutions Oi that ground the clauses Ci such that 
head{ci9i) = head{c292). They specify cpd{ci0i) and cpd(c202), but not the 
distribution required: P{head(ci0i) \ hody{ci) U body{c 2 )). The standard solu- 
tion to obtain the distribution required are so called combining rules; func- 
tions which map finite sets of conditional probability distributions {P(A | 
Ail, ■ ■ ■ , Aim) I * = Ij ■ • • j onto one {combined) conditional probability distri- 
bution P(A I Bi,...,Bk) with {Bi,...,Bk} C U™i We assume 

that for each Bayesian predicate p there is a corresponding combining rule cr, 
such as noisy _or. 

To summarize, a Bayesian logic program B consists of a (finite) set of 
Bayesian clauses. To each Bayesian clause c there is exactly one conditional 
probability distribution cpd{c) associated, and for each Bayesian predicate p 
there is exactly one associated combining rule cr{p). 

The declarative semantics of Bayesian logic programs is given by the an- 
notated dependency graph. The dependency graph DG{B) is that directed graph 
whose nodes correspond to the ground atoms in the least Herbrand model LH(i?) 
(cf. below). It encodes the directly influenced by relation over the random vari- 
ables in LH(R): there is an edge from a node x to a node y if and only if there 
exists a clause c G B and a substitution 0, s.t. y = head{c0), x S body{c0) and 
for all atoms 2 ; appearing in c0 : z £ LH(R). The direct predecessors of a graph 
node X are denoted as its parents, Pa(x). The Herbrand base HB(H) is the set 
of all random variables we could talk about. It is defined as if H were a logic 
program (cf. [IB]). The least Herbrand model LH(H) C HB(H) consists of all 
relevant random variables, the random variables over which a probability distri- 
bution is defined by B, as we will see. Again, LH(H) is defined as if B were be 
a logic program (cf. [18] 1. It is the least fix point of the immediate consequence 
operator applied on the empty interpretation. Therefore, a ground atom which 
is true in the logical sense corresponds to a relevant random variables. Now, 
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m(ann,dorothy) . 
f (brian,dorothy) . 
pc (ann) . 
pc (brian) . 
me (ann) . 



me (brian) . 


mc{X) pc{X) 


P{bt{X)) 


mc(X) 1 m(Y,X) ,mc(Y) ,pc(Y) . 


a a 

b a 


(0.97,0.01,0.01,0.01) 

(0.01,0.01,0.97,0.01) 


pc(X) 1 f(Y,X),mc(Y),pc(Y). 






bt(X) 1 mc(X),pc(X). 


0 0 


(0.01,0.01,0.01,0.97) 


(1) 


(2) 



Fig. 1. (1) The Bayesian logic program bloodtype encoding our genetic domain. To 

each Bayesian predicate, the identity is associated as combining rule. (2) A conditional 
probability distribution associated to the Bayesian clause bt (X) I mc(X), pc (X) rep- 
resented as a table. 



to each node x in DG{B) the combined conditional probability distribution is 
associated which is the result of the combining rule cr(p) of the correspond- 
ing Bayesian predicate p applied on the set of cpd{c0ys where head{c9) = x 
and {x} U body{c6) C LH(i3). Thus, if DG{B) is acyclic and not empty then it 
would encode a Bayesian network, because Datalog programs have a finite least 
Herbrand model which always exists and is unique. Therefore, the following in- 
dependency assumption holds: each node x is independent of its non-descendants 
given a joint state of its parents Pa(a:) in the dependency graph. E.g. the program 
in Figure [1] renders bt(dorothy) independent from pc(brian) given a joint state 
of pc(dorothy) , mc{dorothy) . Using this assumption the following proposition is 
provable: 

Proposition 1. Let B be a Bayesian logic program. If B fulfills that 

1. LB{B) ^ 0 and 

2. DG{B) is acyclic 

then it specifies a unique probability distribution Pb over LH(i3). 

To see this, remember that if the conditions are fulfilled then DG{B) is a Bayesian 
network. Thus, given a total order Xi . . . ,Xn of the nodes in DG{B) the distri- 
bution Pg factorizes in the usual way: P gi^i . . . ,Xn) = I 

where P(a:i | Paa;^) is the combined conditional probability distribution associ- 
ated to Xi. A program B fulfilling the conditions is called well-defined, and we 
will consider such programs for the rest of the paper. The program bloodtype in 
Figure [H encodes the regularities in our genetic example. Its grounded version, 
which is a Bayesian network, is given in Figured This illustrates that Bayesian 
networks |2dJ are well-defined propositional Bayesian logic programs. Each node- 
parents pair uniquely specifies a propositional Bayesian clause; we associate the 
identity as combining rule to each predicate; the conditional probability distri- 
butions are the ones of the Bayesian network. 
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m(ann,dorothy) . 
f (brian , dorothy) 
pc(ann) . 
pc (brian) . 
mc(ann) . 
me (brian) . 
me (dorothy) 
pc (dorothy) 
bt (ann) 
bt (brian) 
bt (dorothy) 



m(ann, dorothy) , me (ann) , pc (ann) . 
f (brian, dorothy) , me (brian) , pc (brian) , 
mc(ann), pc(ann). 
me (brian), pc (brian). 
me (dorothy) , pc (dorothy) . 



Fig. 2. The grounded version of the Bayesian logic program of Figure [T] It (directly) 
encodes a Bayesian network. 



3 Structural Learning of Bayesian Logic Programs 

Let us now focus on the logical structure of Bayesian logic programs. When de- 
signing Bayesian logic programs, the expert has to determine this (logical) struc- 
ture of the Bayesian logic program by specifying the extensional and intensional 
predicates, and by providing definitions for each of the intensional predicates. 
Given this logical structure, the Bayesian logic program induces (the structure 
of) a Bayesian network whose nodes are the relevant random variables. It is 
well-known that determining the structure of a Bayesian network, and therefore 
also of a Bayesian logic program, can be difficult and expensive. On the other 
hand, it is often easier to obtain a set D = {Di, . . . , D^} of data cases. A data 
case Di G D has two parts, a logical and a probabilistic part. 

The logical part of a data case Di G D, denoted as Var{Di), is a Herbrand in- 
terpretation. Consider e.g. the least Herbrand model LH{bloodtype) (cf. Figure |2]) 
and the logical atoms LH{bloodtype) in the following case: 



{m{cecily,fred), / {henry, /red) , pc{cecily) , pc{henry), pc{fred), 
mc(cecily), mc{henry), mc{fred), bt{cecily), bt{henry), bt{fred)} 

These (logical) interpretations can be seen as the least Herbrand models of 
unknown Bayesian logic programs. They specify different sets of relevant random 
variables, depending on the given “extensional context” . If we accept that the 
genetic laws are the same for both families then a learning algorithm should 
transform such extensionally defined predicates into intensionally defined ones, 
thus compressing the interpretations. This is precisely what ILP techniques are 
doing. The key assumption underlying any inductive technique is that the rules 
that are valid in one interpretation are likely to hold for any interpretation. 

^ In a sense, relevant random variables are those variables, which Cowell et al. 
p. 25] mean when they say that the first phase in developing a Bayesian network 
involves to ‘^specify the set of ’relevant’ random variables" . 
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It thus seems clear that techniques for learning from interpretations can be 
adapted for learning the logical structure of Bayesian logic programs. Learning 
from interpretations is an instance of the non-monotonic learning setting of ILP 
(cf. [H]), which uses only only positive examples (i.e. models). 

So far, we have specified the logical part of the learning problem: we are 
looking for a set H of Bayesian clauses given a set D of data cases s.t. VZ?i G 
D : LH(iJ U Var{Di)) = Var{Di), i.e. the Herbrand interpretation Var^Df) is 
a model for H. The hypotheses H in the space H of hypotheses are sets of 
Bayesian clauses. However, we have to be more careful. A candidate set H G TL 
has to be acyclic on the data that means that for each Di G D the induced 
Bayesian network over LH(iL U Var{Di)) has to be acyclic. Let us now focus on 
the quantitative components. The quantitative component of a Bayesian logic 
program is given by the associated conditional probability distributions and 
combining rules. We assume that the combining rules are fixed. Each data case 
Di G D has a probabilistic part which is a partial assignment of states to the 
random variables in Var{Di). We say that Di is a partially observed joint state 
of Var{Di). As an example consider the following two data cases: 



{m{cecily, fred) = true, f {henry, f red) —1 , pc{cecily) — a, pc{henry) = b,pc{fred) =?, 
mc(cecily) = b, mc{henry) = b, mc{fred) =?, bt{cecily) = ab, bt{henry) = b, bt{fred) =?} 

{m{ann, dorothy) = true, f {brian, dorothy) = true, pc{ann) = 6, 
mc{ann) =?, mc{brian) = a, mc{dorothy) = a, pc{ dorothy) = a, 
pc(brian) =?, bt{ann) = ab, bt(brian) =?, bt{dorothy) = a}, 

where ? denotes an unknown state of a random variable. The partial assignments 
induce a joint distribution over the random variables of the logical parts. A 
candidate H G TL should reflect this distribution. In Bayesian networks the 
conditional probability distributions are typically learned using gradient descent 
or EM for a fixed structure of the Bayesian network. A scoring mechanism that 
evaluates how well a given structure H G T-L matches the data is maximized. 
Therefore, we will assume a function scoreo : "H i— K. 

To summarize, the learning problem can be formulated as follows: 

Given a set I? = {Di,. . . , Dm} of data cases, a set H of Bayesian logic programs 
and a scoring function scoreo : "H K. 

Find a candidate H* G TL which is acyclic on the data such that for all Di G D : 
LH(iL* U Var{Di)) = Var{Di), and H* matches the data D best according 
to scorep). 

The best match in this context refers to those parameters of the associated 
conditional probability distributions which maximize the scoring function. For a 
discussion on how the best match can be computed see m or m- The chosen 
scoring function is a crucial aspect of the algorithm. Normally, we can only 
hope to find a sub-optimal candidate. A heuristic learning algorithm solving 
this problem is given in Algorithm 1. 



124 Kristian Kersting and Luc De Raedt 



P(A 

true 


j4i , . . . , An) 
false 


Ai 


A 2 . 


■ • An 


1.0 


0.0 


true 


true 


true 


0.0 


1.0 


false 


true 


true 


0.0 


1.0 


false false 


false 



Table 1. The conditional probability distribution associated to a Bayesian clause 
A I Ai, . . . , An encoding a logical one. 



Background knowledge can be incorporated in our approach in the following 
way. The background knowledge can be expressed as a fixed Bayesian logic pro- 
gram B. Then we search for a candidate H* which is together with B acyclic 
on the data such that for all Di € D : LH(i? U H* U Var(Di)) = Var(Di), and 
BUH* matches the data D best according to scoreo- In [14], we show how pure 
Prolog programs can be repesented as Bayesian logic prorgams w.r.t. the condi- 
tions 1 and 2 of Proposition 1. The basic idea is as follows. Assume that a logi- 
cal clause A : — Ai, . . . , An is given. We encode the clause by the Bayesian clause 
A : — Ai, . . . , An where A, Ai, . . . , A„ are now Bayesian atoms over {true^ false}. 
We associate to the Bayesian clause the conditional probability distribution of 
Figured] and set the combining rule of A’s predicate to max: 

max{P(A I Ail, ... , Ai„J | i = 1, . . . , n} = 

P(A I U”=i{A,i, . . . , A,„J) := max{P(A | A,i, . • . , A„J}. 

i—1 

We will now explain Algorithm 1 and its underlying ideas in more details. 
The next section illustrates the algorithm for a special class of Bayesian logic pro- 
grams: Bayesian networks. For Bayesian networks, the algorithm coincides with 
score-based methods for learning within Bayesian networks which are proven to 
be useful by the UAI community (see e.g. i)- Therefore, an extension to the 
first order case seems reasonable. It will turn out that the algorithm works for 
first order Bayesian logic programs, too. 

3.1 A Propositional Case: Bayesian Networks 

Here we will show that Algorithm 1 is a generalization of score-based techniques 
for structural learning within Bayesian networks. To do so we briefly review these 
score-based techniques. Let x = {x\, . . . , cc„} be a fixed set of random variables. 
The set x corresponds to a least Herbrand model of an unknown propositional 
Bayesian logic program representing a Bayesian network. The probabilistic de- 
pendencies among the relevant random variables are not known, i.e. the proposi- 
tional Bayesian clauses are unknown. Therefore, we have to select such a propo- 
sitional Bayesian logic program as a candidate and estimate its parameters. The 
data cases of the data D = {Di , . . . , Dm} look like 
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Let H be an initial (valid) hypothesis; 

S{H) ■- scoreniH)- 
repeat 

H' ■- H-, 

S{H') ~ S{H)- 

foreach H” € Pg(H') U ps{H') do 

if H" is (logically) valid on D then 

if the Bayesian networks induced by H" on the data are acyclic 

then 

if scoreoiH") > S{H) then 
H := H"- 
S(H) ■- S(H")- 
end 
end 
end 
end 

until S(H') = S{H)- 

Return H-, 

Algorithm 1. 

A greedy algorithm for searching the structure of Bayesian logic programs. 



{m(ann, dorothy) = true, fihrian, dorothy) = true, pc{ann) = a, 
mc{ann) =?, mc(brian) =?, mc{dorothy) = a, mc{dorothy) = a, 
pc(brian) = b, bt{ann) = a, bt(brian) =?, bt(dorothy) = a} 

which is a data case for the Bayesian network in Figure |2] Note, that the atoms 
have to be interpreted as propositions. The set of candidate Bayesian logic pro- 
grams spans the hypothesis space TL. Each H G TL is a, Bayesian logic program 
consisting of n propositional clauses: for each Xi G x a single clause c with 
head{c) = Xi and hody(c) C To traverse H we (1) specify two refinement 

operators pg : "H H> 2^ and ps : "H H> 2^, that take a candidate and modify it 
to produce a set of candidates. The search algorithm performs informed search 
in Ti based on scoreo- In the case of Bayesian networks the operator Pg{H) 
deletes a Bayesian proposition from the body of a Bayesian clause Ci G H, and 
the operator Ps{H) adds a Bayesian proposition to the body oi Ci G H . Usu- 
ally, instances of scores are e.g. the minimum description length score m or the 
Bayesian scores m- 

As a simple illustration we consider a greedy hill-climbing algorithm incorpo- 
rating scorejj{H) := LL{D,H), the log-likelihood of the data D given a candi- 
date structure H with the best parameters. We pick an initial candidate S G TL 
as starting point (e.g. the set of all propositions) and compute the likelihood 
LL{D, S) with the best parameters. Then, we use p{S) to compute the legal 
“neighbours” (candidates being acyclic) of 5” in "H and score them. All neighbours 
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a I b. 
bic. 




a(X) I b(X). 
b(X) I c(X). 
c(X) I d(X). 




a(X) I b(X) ,c(Y). 
b(X) I c(X). 
c(X) I d(X). 



"a(X)lbXXt 
b(^d^£(X). 
-c(X) I d(X),a(X). 



Ps \ a(X) I b(X), c(X). 
b(X) I c(X), d(X) 
c(X) I d(X). 



( 1 ) 



(2) 



Fig. 3. (1) The use of refinement operators during structural search for Bayesian net- 
works. We can add (ps) a proposition to the body of a clause or delete {pg) it from the 
body. (2) The use of refinement operators during structural search within the frame- 
work of Bayesian logic programs. We can add (pa) a constant-free atom to the body 
of a clause or delete (pg) it from the body. Candidates crossed out in (1) and (2) are 
illegal because they are cyclic. 



are valid (see below for a definition of validity). E.g. replacing pc (dorothy) with 
pc(dorothy) I pc(brian) gives such a “neighbour”. We take that S' S p{S) 
with the best improvements in the score. The process is continued until no im- 
provements in score are obtained. The use of the two refinement operators is 
illustrated in Figure |3l 

3.2 The First Order Case 

Here, we will explain the ideas underlying our algorithm in the first order case. 
On the logical level it is similar to the ILP setting learning from interpretation 
which e.g. is used in the CLAUDIEN system l |4l5n j i: (1) all data cases are 
interpretations, and (2) a hypothesis should reflect what is in the data. The 
first point is carried over by enforcing each data case G {Di, . . . , Dm} to 
be a partially observed joint state of a Herbrand interpretation of an unknown 
Bayesian loric program. This also implies that all data cases are probabilistically 
independenfo. The second point is enforced by requiring all hypotheses to be 
(logically) true in all data cases, i.e. the logical structure of the hypothesis is 
certain. Thus, the logical rules valid on the data cases are constraints on the 
space of hypotheses. The main difference to the pure logical setting is that we 
have to take the probabilistic parts of the data case into account. 

Definition 1 (Characteristic induction from interpretations), (adapted 
w.r.t. our purposes from m) Let D be a set of data cases and C the set of all 
clauses that can be part of a hypothesis. H Q C is a logical solution iff H is a 
logically maximally general valid hypothesis. A hypothesis H Q C is (logically) 

® An assumption which one has to verify if using our method. In the case of families 
the assumption seems reasonable. 
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valid iff for all Di G D: H is (logically) true in Di. A hypothesis H (Z C is 
a probabilistic solution iff H is a valid hypothesis and the Bayesian network 
induced by H on D is acyclic. 

It is common to impose syntactic restrictions on the space "H = 2^ of hypotheses 
through the language £, which determines the set C of clauses that can be part 
of a hypothesis. The language C is an important parameter of the induction task. 

Language Assumption. In this paper, we assume that the alphabet of C only 
contains constant and predicate symbols that occur in one of the data cases, 
and we restrict C to range-restricted, constant-free clauses containing maximum 
k = 3 atoms in the body. Furthermore, we assume that the combining rules 
associated to the Bayesian predicates are given. 

Let us discuss some properties of our setting. (1) Using partially observed 
joint states of interpretations as data cases is the first order equivalent of what is 
done in Bayesian network learning. There each data case is described by means 
of a partially observed joint state of a fixed, finite set of random variables. Fur- 
thermore, it implicitly corresponds to assuming that all relevant ground atoms 
of each data case are known: all random variables not stated in the data case 
are regarded to be not relevant (false in the logical sense). (2) Hypotheses have 
to be valid. Intuitively, validity means that the hypothesis holds (logically) on 
the data, i.e. that the induced hypothesis postulates true regularities present 
in the data cases. Validity is a monotone property at the level of clauses, i.e. 
if Hi and i ?2 are valid with respect to a set of data cases D, then Hi U H 2 is 
valid. This means that all well-formed clauses in C can (logically) be consid- 
ered completely independent of each other. Both arguments (1) and (2) together 
guarantee that no possible dependence among the random variables is lost. (3) 
The condition of maximal generality appears in the definition because the most 
interesting hypotheses in the logical case are the most informative and hence 
the most general. Therefore, we will use a logical solution as initial hypotheses. 
But the best scored hypothesis has not to be maximally general, as the initial 
hypothesis in the next example shows. Here, our approach differs from the pure 
logical setting. We consider probabilistic solutions instead of logical solutions. 
The idea is to incorporate a scoring function known from learning of Bayesian 
networks to evaluate how well the given probabilistic solution matches the data. 

The key to our proposed algorithm is the well-known definition of logical en- 
tailment (cf. p[H]). It induces a partial order on the set of hypotheses. To compute 
our initial (valid) hypotheses we use the CLAUDIEN algorithm. Roughly speak- 
ing, CLAUDIEN works as follows (for a detailed discussion we refer to [S]). It 
keeps track of a list of candidate clauses Q, which is initialized to the maximally 
general clause (in £). It repeatedly deletes a clause c from Q, and tests whether 
c is valid on the data. If it is, c is added to the final hypothesis, otherwise, all 
maximally general specializations of c (in C) are computed (using a so-called 
refinement operator p, see below) and added back to Q. This process continues 
until Q is empty and all relevant parts of the search-space have been considered. 
We now have to define operators to traverse T~L. A logical specialization (or gen- 
eralization) of a set H of Bayesian clauses could be achieved by specializing (or 
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generalizing) single clauses c G iL. In our approach we use the two refinement 
operators Ps ■ 2^ ^ 'H and Pg : 2^ H. The operator Ps{H) adds constant- 
free atoms to the body of a single clause c G H, and Pg{H) deletes constant-free 
atoms from the body of a single clause c G H. Figure 0 shows the different re- 
finement operators for the general first order case and the propositional case for 
learning Bayesian networks. Instead of adding (deleting) propositions to (from) 
the body of a clause, they add (delete) according to our language assumption 
constant-free atoms. Furthermore, Figure[^ shows that using the refinement op- 
erators each probabilistic solution could be reached. 

As a simple instantiation of Algorithm 1 we consider a greedy hill-climbing 
algorithm incorporating scores (7L) := LL{D, H). It picks up a (logical) solution 
S G TL as starting point and computes LL{D, S) with the best parameters. For a 
discussion of how these parameters can be found we refer to [ncg. E.g. having 
data cases over LH{bloodtype) and LH{bloodtype), we choose as initial candidate 

mc(X) I m(Y, X). 

pc(X) I f(Y, X). 

bt(X) I mc(X). 

It is likely that the initial candidate is not a probabilistic solution, although it is 
a logical solution. E.g. the blood type does not depend on the fatherly genetical 
information. Then, we use Ps{S) and Pg{S) to compute the legal “neighbours” 
of S' in 'H and score them. E.g. one such a “neighbour” is given by replacing 
bt(X) I mc(X) with bt(X) I mc(X), pc(X). Let S' be that valid and acyclic 
neighbour which is scored best. If LL{D, S) < LL{D, S'), then we take S' as new 
hypothesis. The process is continued until no improvements in score are obtained. 
During the search we have to take care to prune away every hypothesis H which 
is invalid or leads to cyclic dependency graphs (on the data cases) . This could be 
tested in time 0(s • r^) where r is the number of random variables of the largest 
data case in D and s is the number of clauses in H . To do so, we build the 
Bayesian networks induced by H over each Var{Di) by computing the ground 
instances for each clause c G H where the ground atoms are members of Var{Di). 
This takes 0{s ■ rf). Then, we test in 0{ri) for a topological order of the nodes 
in the induced Bayesian network. 

4 Preliminary Experiments 

We have implemented the algorithm in Sicstus Prolog 3.8.1. The implementation 
has an interface to Matlab to score hypotheses using the BNT toolbox m- 
We considered two totally independent families using the predicates given by 
bloodtype having 12 respectively 15 family members. For each least Herbrand 
model 1000 samples from the induced Bayesian network were gathered. 

The general question was whether we could learn the intensional rules of 
bloodtype. Therefore, we first had a look at the (logical) hypotheses space. The 
space could be seen as the first order equivalent of the space for learning the 
structure of Bayesian networks (see Figure E|). In a further experiment the goal 
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was to learn a definition for the predicate bt. We had fixed the definitions for 
the other predicates in two ways: (1) to the definitions the CLAUDIEN sys- 
tem had computed, and (2) to the definitions from the bloodtype Bayesian logic 
program. In both cases, the algorithm scored bt(X) I mc(X) , pc(X) best, i.e. 
the algorithm has re-discovered the intensional definition which was originally 
used to build the data cases. Furthermore, the result shows that the best scored 
solution was independent of the fixed definitions. This could indicate that ideas 
about decomposable scoring functions can or should be lifted to the first or- 
der case. Although, these experiments are preliminary, they suggest that ILP 
techniques can be adapted for structural learning within first order probabilistic 
frameworks. 



5 Related Work 

To the best of our knowledge, there has not been much work on learning within 
first order extensions of Bayesian networks. Koller and Pfeffer [16] show how to 
estimate the maximum likelihood parameters for Ngo and Haddawys’s frame- 
work of probabilistic logic programs [2^ by adapting the EM algorithm. Kersting 
and De Raedt [1^ discuss a gradient-based method to solve the same problem 
for Bayesian logic programs. Friedman et al. m tackle the problem of learning 
the logical structure of first order probabilistic models. They used Structural- 
EM for learning probabilistic relational models. This algorithm is similar to the 
standard EM method except that during iterations of this algorithm the struc- 
ture is improved. As far as we know this approach, it does not consider logical 
constraints on the space of hypotheses in the way our approach does. Therefore, 
we suggest that both ideas can be combined. There exist also methods for learn- 
ing within first order probabilistic frameworks which do not build on Bayesian 
networks. Sato et al. m give a method for EM learning of PRISM programs. 
They do not incorporate ILP techniques. Cussens [3| investigates EM like meth- 
ods for estimating the parameters of stochastic logic programs. Within the same 
framework, Muggleton m uses ILP techniques to learn the logical structure. 
The used ILP setting is different to learning from interpretations and seems not 
to be based on learning of Bayesian networks. 

Finally, Bayesian logic programs are somewhat related to the BUGS lan- 
guage 0. The BUGS language is based on imperative programming. It uses 
concepts such as for-loops to model regularities in probabilistics models. So, the 
differences between Bayesian logic programs and BUGS are akin to the difer- 
ences between declarative programming languages (such as Prolog) and imper- 
ative ones. Therefore, adapting techniques from Inductive Logic Programming 
to learn the structure of BUGS programs seems not to be that easy. 

6 Conclusions 

A new link between ILP and learning within Bayesian networks is presented. We 
have proposed a scheme for learning the structure of Bayesian logic programs. 
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It builds on the ILP setting learning from interpretations. We have argued that 
by adapting this setting score-based methods for structural learning of Bayesian 
networks could be updated to the first order case. The ILP setting is used to de- 
fine and traverse the space of (logical) hypotheses. Instead of score-based greedy 
algorithm other UAI methods such as Structural-EM may be used. The experi- 
ments we have are promising. They show that our approach works. But the link 
established between ILP and Bayesian networks seems to be bi-directional. Can 
ideas developed in the UAI community be carried over to ILP? 

The research within the UAI community has shown that score-based meth- 
ods are useful. In order to see whether this still holds for the first-order case we 
will perform more detailed experiments. Experiments on real-world scale prob- 
lems will be conducted. We will look for more elaborated scoring functions like 
e.g. scores based on the minimum description length principle. We will inves- 
tigate more difficult tasks like learning multiple clauses definitions. The use of 
refinement operators adding or deleting non constant-free atoms should be ex- 
plored. Furthermore, it would be interesting to weaken the assumption that a 
data case corresponds to a complete interpretation. Not assuming all relevant 
random variables are known would be interesting for learning intensional rules 
like nat(s(X)) I nat(X). Lifting the idea of decomposable scoring function to 
the first order case should result in a speeding up of the algorithm. In this sense, 
we believe that the proposed approach is a good point of departure for further 
research. 
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Abstract. This paper tackles the problem that methods for proposition- 
alization and feature construction in first-order logic to date construct 
features in a rather unspecific way. That is, they do not construct features 
“on demand” , but rather in advance and without detecting the need for 
a representation change. Even if structural features are required, current 
methods do not construct these features in a goal-directed fashion. 

In previous work, we presented a method that creates structural fea- 
tures in a class-sensitive manner: We queried the molecular feature 
miner (MolFea) for features (linear molecular fragments) with a min- 
imum frequency in the positive examples and a maximum frequency in 
the negative examples, such that they are, statistically significant, over- 
represented in the positives and under-represented in the negatives. In 
the present paper, we go one step further. We construct structural fea- 
tures in order to discriminate between those examples from different 
classes that are particularly problematic to classify. In order to avoid 
overfitting, this is done in a boosting framework. We are alternating 
AdaBoost re-weighting episodes and feature construction episodes in or- 
der to construct structural features “on demand” . In a feature construc- 
tion episode, we are querying for features with a minimum cumulative 
weight in the positives and a maximum cumulative weight in the neg- 
atives, where the weights stem from the previous AdaBoost iteration. 
In summary, we propose to construct structural features “on demand” 
by a combination of AdaBoost and an extension of MolFea to handle 
weighted learning instances. 



1 Introduction 

In the past few years, both machine learning and inductive logic programming 
have devoted a lot of attention to feature construction |21I18I4I3I15| . Features 
are constructed either in a class-blind jl] or in a class sensitive manner m- 
However, current methods still construct features in a rather unspecific way. 
That is, they do not construct features “on demand”, but rather in advance 
and without detecting the need for a representation change. Even if structural 
features are required, current methods usually do not construct these features in 
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a goal-directed fashion. In this paper, we investigate the possibility of a “demand- 
driven” construction of structural features. 

Changing the representation on demand in ILP is not new. The system 
MODELER formed new concepts if the number of exceptions to a rule 
became unplausibly large. De Raedt [Q proposed to shift the bias to a more 
expressive representation language if the current language is not sufficient to ex- 
press the concept. The present work is also related to other approaches that shift 
the bias to increasingly expressive representation languages, but do not focus on 
the demand-driven aspect |2|1| . The work described in this paper differs in sev- 
eral respects. Firstly, the expressiveness of the representation language is not 
changed. Secondly, the learning setting is the one of inductive concept learning, 
where we assume noise in the data. Thirdly, we deal with the demand-driven 
construction of structural features. 

The idea of demand-driven feature construction in ILP is also related to 
the idea of using ILP methods only where propositional learning fails, or, more 
precisely, of using an ILP algorithm in order to correct the mistakes made by a 
propositional learning algorithm. One submission by Srinivasan and co-workers 
to the PTE-2 challenge [22] based on this idea turned out to be optimal under 
certain cost functions and/or class distributions according to ROC analysis. 

In previous work, we introduced the Molecular Feature Miner (MolFea) 
rMTT] , a domain specific inductive database that is capable of searching for lin- 
ear molecular fragments (corresponding to features) of interest in large databases 
of chemical compounds. For instance, one can query the system for features (lin- 
ear molecular fragments) with a minimum frequency in the positive examples 
and a maximum frequency in the negative examples, such that they are, sta- 
tistically significant, over-represented in the positives and under-represented in 
the negatives. Recent evidence with MolFea showed, that such a class- 
sensitive feature construction can be beneficial in conjunction with classical 
Machine Learning systems. In the present paper, we go one step further. We 
construct structural features in order to discriminate between those examples 
from different classes that are particularly problematic to classify. In order to 
avoid overfitting, this is done in a boosting framework. In a feature construc- 
tion episode, we are querying for features with a minimum cumulative weight in 
the positives and a maximum cumulative weight in the negatives. This requires 
an extension of MolFea to handle not only frequencies of instances, but also 
weights. 

This paper is organized as follows. Sections 2 and 3 introduce the molecular 
feature miner and its extension for handling weighted instances. In Section 4, we 
present our approach to the demand-driven construction of structural features. 
After the section on experimental results, we conclude the paper. 



2 The Molecular Feature Miner MolFea 

In this section, we briefly review the Molecular Feature Miner (MolFea). 
MolFea is a domain specific inductive database which aims at mining molecu- 
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Fig. 1. Example compound in a 2-D representation. ’cZ — c ~ c ~ c ~ c — o’ is an 
example fragment occurring in the molecule. 



lar fragments (features) of interest in chemical data. The level-wise version space 
algorithm (cf. [IB]) forms its basis. More information can be found in [I7lltill7j . 
Inductive databases follow Mannila and Toivonen’s m formulation of the gen- 
eral pattern discovery task. Given a database r, a language C for expressing 
patterns, and a constraint q, find the theory of r with respect to C and q, i.e. 
Th{C,r,q) = {4> & C \ q{r,(j)) is true}. Viewed in this way Th{C,r,q) contains 
all sentences within the pattern language considered that make the constraint q 
true. 

2.1 Molecular Fragments 

A molecular fragment is defined as a sequence of linearly connected atoms. For 
instance, ’o — s — c’ is a fragment meaning: “an oxygen atom with a single bond 
to a sulfur atom with a single bond to a carbon atom” . In such expressions ’c’, 
’n’, ’cl’, etc. denote elements, and ’ denotes a single bond, ’=’ a double bond, 
a triple bond, and an aromatic bond. As common in the literature, we 
only consider “heavy” (i.e., non-hydrogen) atoms in this paper. 

We assume that the system is given a database of example compounds and 
that each of the example compounds in the database is described using a 2-D 
representation. The information given there consists of the elements of the atoms 
of a molecule and the bond orders (single, double, triple, aromatic). An example 
compound in such a representation is shown in Fig. 1. 

A molecular fragment / covers an example compound e if and only if / 
considered as a graph is a subgraph of example e. For instance, fragment ’cl — c^ 
c c ^ c — o’ covers the example compound in Fig. 1. 

There are a number of interesting properties of the language of molecular 
fragments M: 

— fragments in M are partially ordered by the is more general than relation; 
when fragment g is more general than fragment s we will write g < s; 
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— within this partial order, two syntactically different fragments are equivalent 
only when they are a reversal of one another; e.g. ’c — o — s’ and ’s — o — c’ 
denote the same substructure; 

— g < s and only if g is a subsequence of s or g is a subsequence of the 
reversal of s; e.g. ’c — o’ < 'c — o — s’. 

Note that the representation of molecular fragments is relatively restricted 
compared to some other representations employed in data mining, such as first- 
order queries or subgraphs . Although fragments are a relatively restricted 
representation of chemical structure, it is easy for trained chemists to recognize 
the functional group(s) that a given fragment occurs in. Thus, the interpretation 
of a fragment reveals more than meets the eye. 

2.2 Constraints on Fragments 

The features that will be constructed can be declaratively specified using a con- 
junction of primitive constraints C\ A ... A c„. The primitive constraints Cj that 
can be imposed on the unknown target fragments / are : 

— / < p, p < /, -i(/ < p) and -i(p < /): where / is the unknown target 
fragment and p is a specific pattern; this type of primitive constraint denotes 
that / should (not) be more specific (general) than the specified fragment p; 
e.g. the constraint ’c — o’ < f specifies that / should be more specific than 
’c — o’, i.e. that / should contain ’c — o’ as a subsequence; 

— freq{f,D) denotes the relative frequency of a fragment / on a set of 
molecules D; the relative frequency of a fragment / w.r.t. a dataset D is 
defined as the percentage of molecules in D that / covers; 

— freq{f,Di) < t, /reg(/, £> 2 ) > t where t is a positive real number and 
Di and D 2 are sets of molecules; this constraint denotes that the relative 
frequency of / on the dataset Di should be larger than (resp. smaller than) 
or equal to t; e.g. the constraint freq{f,Pos) > 0.95 denotes that the target 
fragments / should have a minimum relative frequency of 95 % on the set 
of molecules Pos. 

If weights are associated with instances, we can generalize frequency-related 
constraints to weight-related constraints: 

— sumjweights{f, D) denotes the sum of the weights of the instances in D 
covered by a fragment /; 

— sum_weights{f,Di) < t, sumjweights{f, D 2 ) > t where t is a positive real 
number and Di and D 2 are sets of molecules; these constraints denote that 
the sum of the weights of those examples in Di (resp. D 2 ) covered by / 
should be smaller than (resp. larger than) or equal to t. 

These primitive constraints can now conjunctively be combined in order to 
declaratively specify the target fragments of interest. Note that the conjunction 
may specify constraints w.r.t. any number of datasets, e.g. imposing a minimum 
frequency on a set of active molecules, and a maximum one on a set of inactive 
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ones. E.g. the following constraint: 

{'c — o' < f) A -i(/ < 'c— o— s — c— o — s')A 

{ffoqif.Act) > 0.95) A freq{f, Inact) < 0.05) 

queries for all fragments that include the sequence ’c — o\ are not a subsequence 
of ’c — o — s — c — o — s’, have a frequency on Act that is larger than 95 percent 
and a frequency on Inact that is smaller than 5 percent. 



3 Solving Constraints 

In this section, we show that the solution space sol{ci A ... A c„) in M for 
a conjunctive constraint ci A ... A c„ is a version space and can therefore be 
represented by its borders. 

Due to the fact that the primitive constraints Ci are independent of one 
another, it follows that 

S 0 l{ci A ... A Cn) = S 0 l{ci) n ... n Sol(Cn) 

So, we can find the overall solutions by taking the intersection of the primitive 
ones. 

Secondly, each of the primitive constraints c is monotonic or anti-monotonic 
w.r.t. generality (cf. [1 8j). A constraint c is monotonic (resp. anti-monotonic) 
w.r.t. generality whenever 

Vs,g € Ai : (g < s) A (g € sol(c)) (s € sol(c)) 

(resp. (s € sol(c)) -A (g € sol(c))). The basic anti-monotonic constraints in 
our framework are: (/ < p), {freq{f, D) > m),{sumjweights{f,D) > m), the 
basic monotonic ones are {p < /), {freq{f, D) < m), {sum-weights{f, D) < m). 
Furthermore the negation of a monotonic constraint is anti-monotonic and vice 
versa. 

Monotonic and anti-monotonic constraints are important because their so- 
lution space is bounded by a border. This fact is well-known in both the data 
mining literature (cf. [18]), where the borders are often denoted by BD'^, as 
well as the machine learning literature (cf. [19]), where the symbols S and G are 
typically used. 

To define borders, we need the notions of minimal and maximal elements of 
a set w.r.t. generality. Let F be a set of fragments, then define 

min{F) = {/ g F | ~^3q G F : f < q} 
max{F) = {/ G F I G F :g</}0 

^ Note that min gives the minimally general elements, and max the maximally general 
ones. In contrast, g < s means that g is more general than s. 
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We can now define the borders S{c) and G(cjl of a primitive constraint c as 

G(c) = max{sol{c)) and S{c) = min{sol{c)) 

Anti-monotonic constraints c will have G(c) = {T} and for proper constraints 
S{c) {_L}; proper monotonic constraints have 5'(c) = {_L} and G(c) ^ {T}. 
Furthermore, as in Mitchell’s version space framework we have that 

sol{c) = {/ e M I 3s G S{c),3g G G(c) ■. g < f < s} 

This last property implies that 5'(c) (resp. G(c)) are proper borders for anti- 
monotone (resp. monotone) constraints. 

So, we have that the set of solutions sol{ci) to each primitive constraint is a 
simple version space completely characterized by S(ci) and G(ci). Therefore, the 
set of solutions sol{c\ A ... A c„) to a conjunctive constraint ci A ... A Cn will also be 
completely characterized by the corresponding ^(ci A ... A c„) and G(ci A ... A c„). 

Elsewhere |7I6| , we have presented algorithms for computing the S and G sets 
corresponding to the constraints. The algorithm basically integrates the levelwise 
algorithm by Mannila and Toivonen [T^ with Mellish’s description identification 
algorithm. In principle, one might also employ Hirsh’s version space merging 
algorithm HI]. 

4 Constructing Structural Features on Demand 

In the previous section, we summarized the basic ideas underlying our domain 
specific inductive database MolFea. Using MolFea, we can declaratively spec- 
ify the features of interest by a set of constraints. The space of all fragments (and 
corresponding to the fragments, the features) satisfying the constraints takes the 
form of a version space. 

In order to construct structural features on demand, we performed two pre- 
liminary experiments: in the first one, we performed iterations of feature con- 
struction, where we queried for features that discriminate two individual, mis- 
classified examples from different classes. In the second, we queried for features 
that discriminate between the false positives (resp. false negatives) and the pos- 
itives (resp. negatives), either in one or in several iterations. In all of these cases, 
there were indications of overfitting. Thus, we devised a novel approach in the 
framework of boosting. 

In the novel approach, we interleave AdaBoost m re-weighting episodes and 
feature construction episodes. Another view on the approach would be that we 
are boosting a weak learner that consists of weight-sensitive feature construction 
and some propositional learning algorithm. 

In Table [T] the pseudo-code of our novel approach is shown. As the pseudo- 
code indicates, we perform regular AdaBoost iterations. In the first iteration, 

^ At this point, we will follow Mitchell’s terminology, because he works with two dual 
borders (a set of maximally general solutions G and a set of maximally specific ones 
S). In data mining, one typically only works with the 5-set. 
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Table 1. Pseudo-code of AdaBoost, as instantiated in the presented approach. The 
basic AdaBoost procedure repeatedly queries MolFea for features (fragments) that 
are over-represented in the positive examples and under-represented in the negative 
examples E-. These features are used in a propositional learner. For simplicity of 
presentation, we included two exit statements in the pseudocode. 



procedure ADABoosx(i?+, E-, PropFeatures, Maxit) 

All training examples in E+ U E- are weighted by Wi = 1 /N 

j ■■= 1; 

repeat 

if j = 1 

then Fsj ~ PropFeatures 

else Fsj := MolFea(£'+, F_); 

if Fsj — 0 then exit 

Hj := PropLearner(Fsj, F+, i5_); 

tj := weighted error rate of Hj on E+ U E- 

if tj = 0 or tj >1/2 then exit 

Pj ■- 

for each ei € F+ U E^ do 
if misclassifies{Hj,ei) 
then Wi := Wi^—^ 

renormalize all weights Wi to sum up to 1 
until j = Maxit 

return hypotheses Hj weighted by Pj 



we use the initial propositional features PropFeatures that are given as an in- 
put argument. In further iterations, MolFea attempts to construct structural 
features with a high cumulative weight in the positive examples and a low cumu- 
lative weight in the negatives. Thus, the feature construction algorithm (like the 
propositional learner applied subsequently) focuses attention on those examples 
that are difficult to classify. 

The parameters for calling MolFea are determined as follows: We are seek- 
ing features (fragments) that are over-represented in the positive examples 
and under-represented in the negative examples if_. If we had a dataset with 
a weight of one for each instance, we could do the following: We are interested 
in fragments with a minimum frequency of, say, 6, 10, 15, 20, respectively, and 
apply the y^-Test to a 2 x 2 contingency table with the class as one variable and 
the occurrence of the fragment as the other one to determine the maximum al- 
lowable frequency in the negative examples. If we multiply the AdaBoost weights 
of the examples by N, the number of training examples, then we can proceed in 
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Table 2. Results of the new approach in standard ILP benchmark domains. It. de- 
notes the number of iterations, Acc. denotes the predictive accuracy of ten-fold cross- 
validation if the maximum number of iterations was set to the resp. value; # H. ’s 
denotes the number of hypotheses from ten-fold cross-validation that are constructed 
in the resp. iteration (note that the process can stop due to a number of reasons); Av. 
C. denotes the average number of conditions in PART hypotheses of the resp. iteration 
(also note that PART might just return the empty default theory). 







PTE 




Mutag. 




Biodeg. 




It. 


Acc. # H. ’s Av. C. 


Acc. # H. ’s Av. C. 


Acc. # H. ’s Av. C. 


1 


62.9 


10 


1.0 


81.4 


10 


6.3 


59.5 


10 


2.1 


2 


62.9 


10 


7.9 


79.8 


10 


14.8 


67.7 


10 


9.0 


3 


63.8 


10 


2.6 


88.8 


10 


25.6 


72.9 


10 


7.9 


4 


63.8 


3 


0.3 


85.6 


10 


22.9 


73.8 


9 


4.9 


5 


63.8 


1 


0.0 


86.7 


10 


14.8 


74.7 


5 


4.6 


6 


- 


- 


- 


88.3 


10 


9.9 


74.7 


3 


0.3 


7 


- 


- 


- 


88.3 


8 


2.1 


74.7 


1 


0.0 


8 


- 


- 


- 


88.8 


5 


3.6 


- 


- 


- 


9 


- 


- 


- 


87.8 


5 


3.0 


- 


- 


- 


10 


- 


- 


- 


88.3 


3 


1.0 


- 


- 


- 



analogy to the above case in order to determine the values of the minimum and 
maximum cumulative weights in the queries posed to MolFeaH 

5 Experimental Results 

The goal of the experimentation (as described in this section) was to show 

— that it is possible to improve upon the initial propositional representation 
by the construction of new structural features, and 

— that the approach is not prone to overfitting. 

We performed experiments in three real-world domains: carcinogenicity pre- 
diction |22] (the PTE-2 dataset), mutagenicity prediction [20] and biodegrad- 
ability prediction |H| . For biodegradation prediction, we used a two-class version 
(degradable or not) with a half-life time (HLT) threshold of 4 weeks. 

The initial propositional features were: the result of the Ames test for the 
PTE data, the LUMO and logP values for mutagenicity, and the molecular 
weight and the logP for biodegradability. The maximum number of iterations 
was set to 10. PART |2], one of the best rule learning systems available today, 
was chosen as the propositional learner in the AdaBoost iterations. The set- 
tings for feature construction were as described above: the minimum cumulative 
weight (multiplied by N) was set to 6, 10, 15 and 20, respectively. The maximum 
cumulative weight was set dependent on the class distribution (that is actually 
modified during AdaBoost iterations). 

® Note that the sum of weights of the training examples is always one in AdaBoostMl. 
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Table 2 summarizes the results from 10-fold cross- validation: In all three 
domains, it shows that the approach is able to improve upon the initial repre- 
sentation. As the number of iterations increase, the generalization performance 
increases as well, and peaks after a few iterations. Thus, we may conclude that 
the approach indeed improves over the initial propositional representation and 
that it shows no sign of overfitting. Interestingly, the algorithm stops quite early 
in several cases, since MolFea cannot find any more statistically significant 
structural features in the data. 

Note that we did not make any attempts to tune the parameters: we just used 
AdaBoostMl in conjunction with PART as is (the default settings). Besides, as 
mentioned above, our goal was to show that the overall approach improves upon 
the initial representation and that it does not overfit the training data. 



6 Conclusion 

In this paper we tackled the problem of a demand-driven construction of struc- 
tural features by a combination of weight-related queries posed to the molecular 
feature miner MolFea and the AdaBoost framework. Although the approach 
was presented in the context of molecular fragments, it is possible to adapt it 
to other pattern domains, such as, e.g., first-order Datalog queries. We believe 
that many extensions of this work are conceivable. Different pattern domains, 
different machine learning systems and different variants of boosting could be 
investigated in the future. 
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Abstract. Given the very widespread use of multirelational databases, 
ILP systems are increasingly being used on data originating from such 
warehouses. Unfortunately, even though not complex in structure, such 
business data often contain highly non-determinate components, making 
them difficult for ILP learners geared towards structurally complex tasks. 
In this paper, we build on popular transformation-based approaches to 
ILP and describe how they can naturally be extended with relational 
aggregation. We experimental! y show that this results in a multirela- 
tional learner that outperforms a structurally-oriented ILP system both 
in speed and accuracy on this class of problems. 



1 Introduction 

Relational databases and data warehouses are arguably the most widespread and 
commonly used technology for storing information in business, industry, and ad- 
ministration. The increasing popularity of data warehouses in particular has 
highlighted the fact that there is a large reservoir of application problems in the 
business area which would benefit from multirelational analysis techniques. Busi- 
ness databases, however, present somewhat different challenges than those found 
in the classical show case areas of ILP, such as molecular biology or language 
learning. Whereas the latter often involve highly complex structural elements, 
perhaps requiring deep nesting and recursion, business databases are usually 
structurally simple and restricted to a function-free representation. Their chal- 
lenges are in two other directions. Firstly, it is quite normal that such databases 
are highly non-determinate — consider the case of a bank where a separate table 
stores thousands of transactions of a particular customer. Secondly, even though 
the number of involved relations may not be huge, the total size of these relations 
often will be, and scalability to perhaps millions of tuples is important. 

Given these differing characteristics, an interesting question is whether it 
would not be beneficial to construct ILP learning systems that are optimised for 
these kinds of applications, just as many of today’s state-of-the-art ILP systems 
are geared more towards structurally complex applications. In this paper, we 
show that indeed it is possible to construct a learning system that is well suited 
to such domains by building on approaches from the fields of ILP and the field 
of databases. 
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In particular, to ensure scalability, we have adopted a transformation-based 
approach where an ILP-problem is first transformed into a propositional problem, 
and then handled by a fast propositional learner. Whereas existing ILP learners 
based on transformation either handle only constrained or determinate clauses 
m, such as LINUS [7] or DINUS g], or use heuristically or bias-selected clauses 
as simple binary features | 6I9| . in our approach we fully treat non-determinate 
clauses by emploing the idea of aggregation from the area of databases, thus 
allowing non-determinate relationships to be represented in summary features 
of the propositional representation. 

In an experimental evaluation on different learning problems arising from the 
multirelational data supplied by the ECML-1998 and the PKDD-1999/PKDD- 
2000 challenges, respectively, we show that indeed this approach outperforms 
both more restricted transformation-based learners, as well as a state-of-the-art 
ILP learning system geared more towards structurally complex domains. We 
compare the use of decision trees and support vector machines as propositional 
learners, and conclude that our approach reaches a good performance and fast 
runtimes with both of these. 

The paper is organised as follows. In Section 2, we give an introduction to 
transformation-based approaches to ILP. In Section 3, we provide details of our 
own feature construction method, discuss its declarative bias language which is 
based on foreign links, and show how it incorporates the aggregation operator. In 
Section 4, we give a detailed experimental evaluation of our method, comparing 
it both to simpler transformation-based learners and to a state-of-the-art non- 
transformation-based ILP system. Section 5 provides references to related work, 
and Section 6 concludes and gives pointers to future work. 



2 Transformation-Based Approaches 

As usual in ILP, in this paper we assume that we are given a set of positive 
examples a set of negative examples E~, and background knowledge B. 
Since we are dealing with data originating in relational databases, we will assume 
that is a set of ground p-atoms, i.e., atoms the predicate of which is the target 
predicate p (of arity a). Similarly, E~ is a set of ground negated p-atoms, and B 
is a set of ground atoms using different background knowledge predicates. The 
learning task can then be defined as follows. 

— Given: E~ , B as described above, such that U E~ U i? ^ □ and 

B^E+ 

— Find: A hypothesis h from a set of allowed hypotheses E[ such that the error 
of h on future instances is minimised. 

In ILP, h is usually a set of first-order clauses, and a new instance is classified 
as positive if and only if it is covered by this set of clauses. In a transformation- 
based approach to ILP, on the other hand, we assume we are given a trans- 
formation function r which transforms the given E^, E~ , and B into a single 
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propositional table. One then uses a propositional learner on this table, pro- 
ducing a propositional hypothesis h which can then be used to classify future 
instances (which of course first need to be transformed by r as well)Q. 

In principle, designers of transformation-based ILP systems are not restricted 
to any particular form of t functions. In practice, it is commonplace to base 
the transformation on an implicit first-order hypothesis space H, and use the 
literals and variable bindings of the clauses in H to define the transformation. For 
example, in the pioneering work on LINUS [3, a space of constrained clauses 
was used, whereas in its successor system DINUS [^, a space of determinate 
clauses [T^ was used instead. As an alternative, if selected arbitrary clauses are 
used, one can apply existential transformations and use the clauses as binary 
features PEI. In order to better understand this framework, and to allow for an 
easier description of our own work, we will now describe this process of defining 
transformation functions in more detail. 



2.1 Transformation Ftinctions Based on Clauses 

We will start by assuming that we are given a set C of clauses upon which 
feature generation is to be based. Note that C can be a systematically defined 
entire hypothesis space, but could also consist of a few selected clauses, so the 
following formalisation also covers the case of using individual clauses (perhaps 
learned by a non-transformation-based ILP learner) for feature generation as it 
is suggested in |2]. As a piece of notation, for a target predicate p of arity a, let 

T :=piX,,...,Xa) (1) 

denote the most general p-atom. Since we are considering a single predicate 
learning task, we can assume without loss of generality that all C £ C have T 
as head. 

Let bvars(C) denote the ordered set of body variables of C. For a clause C 
with 

bvars{C) = {Yi, (2) 

and a ground p-atom e, let 

val{C, e) := {(Ficr, ..., YmCr) \ Ca C B U {e}} (3) 

denote the different value combinations assumed by the body variables of C 
when matching the clause head against the example and the clause body against 
the background knowledg^. Note that for determinate clauses [12], val{C,e) 
contains exactly one tuple. 

^ Depending on the transformation and the propositional learner that are used, in 
certain cases it is even possible to transform the propositional learning results back 
into an equivalent clausal theory [718 

^ To simplify our notation, we are treating B as constant and do not mention it 
explicitly in our definitions. 
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We can now define a propositionalisation function as follows: 

ip : C,e,T ^ , (4) 

where C G C, e is a ground p-atom, and T is a set of tuples of width | bvars{C) |. 
In other words, ip produces the tuple of desired feature values for an example e 
with respect to the literals and variable bindings of the clause C. Sometimes, it 
will be useful to also have a function which generates not the individual feature 
values^ but the list of names (and types) of the features that are the result of 
propositionalising based on C: 



<P : C Atti , ..., Attn^ u 



(5) 



Note that since in a propositional table, all examples must have the same at- 
tributes, <P and the width of ip must not depend on e. Also note that for both ip 
and we implicitly assume that the variables of each clause are typed, so ip and 
(p can make use of this information when performing the propositionalisation. 

Here are two simple examples of using clauses in their entirety as features. 
The first is the transformation used in |6|9| on selected (parts of) clauses to 
transform them into binary features. 

Example 1 (Existential Features). This transformation simply records whether 
C is present in e: 



P3(C,e,T) 



(1) if I T |> 0 , 

(0) otherwise. 



( 6 ) 



Example 2 (Counting Features). As a slight generalisation of the previous ex- 
ample, this function counts how often C can be matched against the example e 
and background knowledge B: 

ip#{C,e,T):=i\T\) . (7) 

In order to define the complete row of features corresponding to a particular 
example, we simply concatenate the features generated with respect to each 
clause in C with the values of the variables in T. For a p-atom e = Ter, the 
propositionalisation with respect to C is defined as follows: 

prop{C, e) := (Aicr, ..., AaCr) ^ p(C, e, val{C, e)) , (8) 

CeC 



where 0 denotes tuple concatenation. 

Finally, the propositionalised table of examples is defined as the union of all 
example propositionalisations, adding in the class attribut^: 

t{C,E~^,E~) := {prop{C,e)®{l) \ e G £'+}U{prop(C, e)©(0) | -le G E~} . (9) 

® Note that this definition can easily be adapted to the case where one of the argu- 
ments of T is the attribute to be predicted. 
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2.2 Local Functions and Redundancy 

An important class of propositionalisation functions is the class of local proposi- 
tionalisation functions which compute propositional features taking into account 
only one of the body variables at a time. 

if is local iff there is a function ip' such that 

ip{C,e,T)= 0 ^'(7rp)(T)) , (10) 

i=l . .width{T) 

where 7T(i) denotes projection on the i-th column. 

This class of propositionalisation functions is important because it easily 
allows the removal of redundant features whenever there are functional depen- 
dencies between a single predicate (or set of predicates) and another predicate. 
If H is a set of atoms, L an atom, then D t> L is a, functional dependency iff 



for any cr such that 








Da C E U B , 


(11) 


there is exactly one 9 such that 








La9 G E U B . 


(12) 



Note that functional dependencies are closely related to the idea of determi- 
nate literals |12| . except that for determinate literals, one often allows at most 
one substitution given the preceding literals, whereas a functional dependency 
requires that there be exactly one such substitution. 

For local propositionalisation functions, we can drop all the features gener- 
ated based on one clause if there is another clause which differs from it only in 
that it contains an additional functionally dependent literal. The reason for this 
is expressed in the following lemma. 

Lemma 1. Let C and C two clauses from C such that 

C' = C\J{L} . (13) 

If there is a functional dependency D \> L such that 

D>C (14) 

(D 9 -subsumes C), then for any local ip, and any p-atom e, 

ip{C',e,val{C',e)) = ip{C,e,val{C,e)) ^ ip'{iTt^^){val{C',e)) , (15) 

zeVL 

where we assume that Vl are the variables of L not occurring in C. 

Proof. Clearly, due to the functional dependency, for any variable binding tuple 
in val{C, e) there will be exactly one completion resulting in a matching tuple in 
val{C',e). This means that val{C,e) and val{C",e) are different, but since the 
transformation function is local, the extra columns in val(C' , e) do not influence 
the computation of the feature values on variables contained in both (7 and C , 
so the feature values computed for these variables with respect to C and C will 
be identical. 
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This means, it suffices to consider C when constructing prop{C, e) since the 
features constructed based on C will be redundant. 

Note that this lemma can be generalised to cases where there is more than 
one additional functionally dependent literal, and to cases where ip also produces 
existential or counting features. 

In our approach to be described in the next section, we assume that the 
functional dependencies to be used for redundancy removal are explicitly given 
by the user. However, it will of course also be possible to use one of the existing 
algorithms for functional dependency discovery to automate this step. 

3 Propositionalisation by Automatic Aggregation 

As pointed out in the introduction, the primary challenge in propositionalising 
ILP data is due to the non-determinacy of most applications. In the terminology 
introduced in the preceding section, this means that val{C, e) can become quite a 
large set. This is especially true in business applications, where it is quite possible 
for example that a company maintains hundreds of transactions on record for 
a single customer. Previous approaches to propositionalisation in ILP that were 
restricted to determinate clauses thus cannot adequately handle such datasets. 

Transformation function. In order to design our approach to transformation- 
based ILP learning, we have therefore borrowed the idea of aggregation that 
is commonplace in the database area and often used in preprocessing for 
propositional learners. Aggregation is an operation that replaces a set of values 
by a suitable single value that summarises properties of the set. For numerical 
values, simple statistical descriptors such as average, maximum, and minimum 
can be used, for nominal values, we can use the mode (the most frequent value) 
or count the number of occurences of the different possible values. 

More precisely, in the framework of the preceding section, we define a local 
propositionalisation function p' as follows. Let C be a clause with bvars{C) = 
{Yi , . . . , Yjn}. For a numeric variable F) S bvars(C), let Ti := e). Then 

define 

p'(Ti) := {avg{Ti),max(Ti),min(T^),sum{Ti)) , (16) 

where avg{Ti), max(Ti), min(Ti), and sum{Ti) compute the average, maximum, 
minimum, and sum of the values in Ti, respectively. For a nominal variable 
Yi G bvars{C), let Ti as above. Then define 

T'{Ti):= 0 {count{v,Ti)) , (17) 

v^domainiYi) 

where domainfYi) is the ordered set of possible values for Yi, and count{v,Ti) 
is a function that provides the number of occurences of value v in Ti (again, 
0 denotes tuple concatenation). In addition, we use the total size of the set 
T := val{C, e) as a feature, resulting in the transformation function 

p{C,e,T):={\T\) 0 p\T,) . 

2=1. .m 



( 18 ) 
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A function was chosen to produce attribute names for the tuples result- 
ing from propositionalisation. This function ensures unique attribute names by 
including the following information about the items used in the computation of 
the attribute values: a short name of the aggregate function applied, the name 
of the predicate from E or B concerned, the position/name of the argument, if 
applicable, and an identification of C € C. 

Clause set. In order to decide which clause set C to use as the basis of our 
transformation, consider again the nature of business databases. Typically, they 
will exploit foreign key relationships to structure their data. We have therefore 
chosen to generate the set C on the basis of the foreign link bias language which 
was first introduced in Midos [TTIT^ and allows to easily model the structure 
of such databases. This bias is an ordered set of links £, where each I G C 
provides information about the argument positions of literals of two predicates 
where variables may be shared. As an additional level of control, our declarative 
bias language allows the specification of an upper limit on the number of literals 
with which a given literal may share variables. This limit effectively controls the 
branching factor in the tree of literals generated by the foreign links. 

Redundancy removal. In order to exploit the potential offered by Lemma 1 for 
removing redundant features, we also allow the user to specify a set of functional 
dependencies T . 

The components discussed above result in an algorithm which is given in Ta- 
ble |T] Step 2 of the algorithm implements the clause construction process based 
on foreign links, removing redundant clauses (and thus the redundant features 
they would otherwise give rise to) in step 2b. Steps 3 to 5 implement the actual 
construction of the propositional table based on the transformation function Lp 
defined above. Step 6 finally is a normalisation step which maps the value range 
of each numeric attribute to the interval [—1,1]; this is used for propositional 
learners which benefit from normalised value ranges. 

Example 3. The following examples are based on the financial dataset of the 
PKDD-1999/PKDD-2000 challenge that is utilised in the first half of the ex- 
periments. For illustrative purposes, the number of relations and examples is 
reduced here, and all but the key attributes (with primary keys always in the 
first argument positions) are invented. 

Let the sets of positive and negative examples be 
S'*" = {e} with e = loanCl, 1), = 0 . 

Let the set of background knowledge atoms be 
B — { account (1) , 

dispositionCl , 1 , 1 , 10 , a) , disposition (2 , 1 , 2 ,20,b) , 

client (1 , 1000) , client (2 , 2000) , 

cardd , 1 , 100) , card(2,2,200) , card(3,2,300)} . 

Let the ordered set of foreign links (obeying the pattern: 
link(<rell> : <posl> , <rel2> : <pos2>, with “rel” for relation, “pos” for argu- 
ment position) be 
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Table 1. RELAGGS algorithm 



1. Accept as input: E, B {n predicates), C, T 

2. Construct C: 

(a) Generate all clauses C as ordered sets of literals L subject to following re- 
strictions: 

i. I C |< n -I- 1 

ii. For L,L' £ C ■. predicate{L) ^ predicate{L') 

iii. First L £ C : predicate(L) = target predicate 

iv. L £ C: most general, i.e. variables in argument positions 

V. L, L' £ C, L non-first, 3L' before L in C: I £ C such that L shares variable 
with L' 

(b) Eliminate C if there is C' such that C = CiLi, C' = CL 2 C 2 , with f £ T 
specifying functional dependency between Li and L 2 

3. Generate new line for TABLE 

4. For all C £ C 

(a) Determine L>(C) 

(b) For all Atti £ ^(C), append Atti to TABLE 

5. For all e £ F 

(a) Generate new line for TABLE 

(b) For all C £ C 

i. Determine T = val(C,e) 

ii. Determine p{C, e, T) 

iii. For all v £ ip{C, e, T) append v to TABLE 

(c) Append class value of e to TABLE 

6. Normalise feature values of TABLE to [—1, 1] and append those to T ABLEnorm 

7. Output TABLE and T ABLEnorm 



C = { link (loan: 2, account: 1) , 
linkCaccount : 1 .disposition: 2) , 
link(disposition:3, client: 1) , 
link(disposition: l,card:2)} . 

Now consider 

Cl = loan(A,B) :- account(B), disposition(C,B,D,Xl,X2). 

In a first step, val{Ci,e,T) is determined, which is depicted in Table Here, 
each line corresponds to a tuple of values of val{Ci,e,T). In a second step, (p 
and T are applied and result in Table El which shows the propositionalised table 
of E and B with C = Ci. 



Table 2. The value set val{Ci,e) 



Aa 


Ba 


Ca 


Da 


Xlcr 


A2cr 


1 


1 


1 


1 


10 


a 


1 


1 


2 


2 


20 


b 
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Table 3. The propositionalised table based on Ci ,E~)) 



count 


avg(Xl) 


max(Xl) 


min(Xl) 


sum (XI) 


count(X2=a) 


count (X2=b) 


class 


2 


15 


20 


10 


30 


1 


1 


1 



Let Cl as above, C2 = 

loan(A,B) account(B), disposition(C,B,D,Xl ,X2) , client(D,Y). 

Let us assume, the set of functional dependencies T contains a description of 
such a dependency between disposition and client, i.e. 

{disposition(_,_,D,_,_)} [> client(D,_). 

Then, val{C2, e) produces tuples as depicted in Table |4]on the left. The result of 
val{C2, e) differs from val{Ci, e) only in the additional column for Y. Especially, 
the columns for XI and X2 are the same in both tables such that any local 
aggregate function applied here would not yield different results for val{C\,e) 
and val{C2,e). Hence, we can decide to not consider Ci. 



Table 4. The value sets val{C 2 ,e) and val{C 3 ,e) 



A(j 


Ba 


Ca 


Dcr 


Xla 


X2a 


Ya 


1 


1 


1 


1 


10 


a 


1,000 


1 


1 


2 


2 


20 


b 


2,000 



Act 


Ba 


Ca 


Da 


Xla 


X2a 


Ea 


Za 


1 


1 


1 


1 


10 


a 


1 


100 


1 


1 


2 


2 


20 


b 


2 


200 


1 


1 


2 


2 


20 


b 


2 


200 



Let now C2 as above, C3 = 

loan(A,B) account(B), disposition(C,B,D,Xl ,X2) , card(E,Z). 

For this clause, the functional dependency given above does not apply. Table |4] 
shows val{C3,e), on the right. Here, there are differences with respect to the 
columns for XI and X2 of val(C2,e) and val{C^,e). This way, there can be dif- 
ferent aggregates as well. For example, the average of XI for val{C2,e) is 15, 
while it is 16.6 for val{C^,e). This can be viewed as weighting the property XI 
of a disposition in the light of the number of credit cards issued for this dispo- 
sition. This illustrates why our algorithm will consider both C2 and C3 for the 
computation of the final propositionalised table. 



4 Experimental Evaluation 

The goal of our experiments was to see whether indeed the business databases 
that we are focusing on possess properties that distinguish them from other ILP 
learning problems in such a way that they lead to markedly different behaviour 
of state-of-the-art ILP learners (which are not optimised for such problems) 
compared to our own approach which we will call RELAGGS. As the basis of 
our experiments, we have therefore used eight learning problems originating from 
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business domains, in particular from the discovery challenges that took place in 
conjunction with PKDD-1999/PKDD-2000 and ECML-1998, respectively. 

The first discovery challenge problem from PKDD that we used involves 
learning to classify bank loans into good and bad loans [T] . The data set comprises 
of 8 relations and contains for each loan, customer information such as personal 
data, credit card ownership data, and socio-demographic data, moreover, account 
information is provided including permanent orders and up to several hundreds 
of transactions per account. This problem is thus typical of the non-determinacy 
that was discussed in the introduction. 

The application gives rise to four different learning problems. The first two 
problems feature 234 examples which consist of all loans that are finished, i.e., 
either paid back in full or defaulted. Since certain learning systems exhibited 
marked differences in learning result depending on whether the good or the bad 
loans were used as target concept, we chose to split this task into two problems: 
learning using the good loans as target concept (Loan234A), and learning using 
the bad loans as target concept (Loan234B). Similarly, there are two further 
problems based on a data set of 682 examples which also contain loans that are 
still “in progress” , but for which a classification of current loan rating (good or 
bad) is available (Loan682AC uses the good loans, Loan682BD the bad loans as 
target concept). 

The other learning problems are taken from the customer data warehouse 
of a large Swiss insurance company (cf. |3), where 10 tables were extracted 
for the ECML challenge mentioned above. Again, information is included about 
the customers (called partners), plus information about their contracts, their 
roles in the contracts, their households etc. In this application, two unspecified 
learning tasks were provided with challenge data; they involve learning (different) 
classifications of 17,267 customers (Parti and Part2 in the table) and 12,934 
households (Hholdl and Hhold2). 

In order to evaluate the hypotheses, we chose to compare our approach RE- 
LAGGS to two other systems. First, we compared with DINUS-G, a learner based 
on transformations in the style of DINUS [8], i.e. using only the features that 
are determinate, and using G4.5rules 1131 as propositional learner. Second, we 
compared to PROGOL a very powerful state-of-the-art ILP learner capable 
of learning in structurally very complex domains. For PROGOL, we used stan- 
dard parameter settings. We also experimented with other parameter settings, 
however, this did not yield better results. 

The aggregates produced by RELAGGS comprise of 938 attributes for the 
Loan tasks, while those for the Part and Hhold tasks consist of 1,688 and 
2,232 columns, respectively. In order to examine whether the success of our 
transformation-based approach depends on the propositional learner that is used, 
we used two variants of RELAGGS: RELAGGS-G uses G4.5rules [13], whereas 
RELAGGS-S uses m. a fast support vector machine implementation. 

For G4.5rules, we used standard parameter settings, while for we used 

normalised data and parameter settings as applied in experiments reported in 

m- 
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All experimental evaluations were carried out as ten-fold cross-validations. 
Table |5] shows, for each of the participating learners, the average error across the 
ten folds and the standard deviation; the best result on each problem is marked 
in bold. In addition, we provide the order of magnitude of the time taken for 
each learning ruiQ 



Table 5. Error rate averages and standard deviations from 10-fold cross-validation, 
and runtime order of magnitude for single training runs (C - C4.5rules, S - 



Task 


Measurand 


DINUS-C 


RELAGGS-C 


RELAGGS-S 


PROGOL 


Loan234A 


Error rate 


14.9 ± 10.3 


12.0 ± 6.5 


12.0 ± 5.3 


54.3 ± 10.5 


Runtime 


sec 


min 


min 


h 


Loan234B 


Error rate 


14.9 ± 10.3 


12.0 ± 6.5 


12.0 ± 5.3 


13.3 ± 7.1 


Runtime 


sec 


min 


min 


min 


Loan682AC 


Error rate 


11.1 ± 3.6 


5.9 ± 3.2 


9.2 ± 3.2 


n.a. 


Runtime 


sec 


min 


min 


d 


Loan682BD 


Error rate 


11.1 ± 3.6 


5.9 ± 3.2 


9.2 ± 3.2 


11.3 ± 3.6 


Runtime 


sec 


min 


min 


h 


Parti 


Error rate 


18.1 ± 0.6 


8.3 ± 0.7 


9.9 ± 0.4 


n.a. 


Runtime 


min 


h 


h 


d 


Part 2 


Error rate 


18.1 ± 0.6 


8.3 ± 0.7 


9.9 ± 0.4 


n.a. 


Runtime 


min 


h 


h 


d 


Hholdl 


Error rate 


39.6 ± 1.6 


6.0 ± 0.9 


14.3 ± 1.5 


n.a. 


Runtime 


min 


h 


h 


d 


Hhold2 


Error rate 


39.6 ± 1.6 


6.0 ± 0.9 


14.3 ± 1.5 


n.a. 


Runtime 


min 


h 


h 


d 



As can be seen from the table, the business databases used here indeed seem 
to possess properties that make it appear useful to use an aggregating learner 
like RELAGGS. On all eight problems, it is one of the RELAGGS variants which 
shows the lowest error. In addition, it is remarkable that RELAGGS strikes a 
reasonable balance in runtime between the use of determinate features only and 
the use of a non-transformation based learner like PROGOL. Interestingly, the 
complex search performed by PROGOL does not lead to good accuracies here, 
indicating that such business domains present quite different challenges from the 
problems on which PROGOL excels. 

The differences between the learners are in fact statistically significant. Ta- 
ble 0 shows the win-loss-tie statistics, where a comparison was counted as a win 
if the difference was significant according to a paired t-test at level a = 0.05. As 
can be seen there, the RELAGGS variant using G4.5rules significantly outper- 
forms both RELAGGS with SVM as well as the other two learners. 



^ Five PROGOL runs were aborted after running more than two days; the correspond- 
ing fields are marked n.a. 
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Table 6. Win-loss-tie for rows vs. columns: differences statistically significant according 
to paired t-test at level a — 0.05 (C - C4.5rules, S - 





DINUS-C PROGOL RELAGGS-S 


RELAGGS-C 

RELAGGS-S 


6-0-2 2-0-1 6-0-2 

4-0-4 1-0-2 



5 Related Work 

Our approach is based on the general idea to transform multirelational represen- 
tations of data into representations amenable for efficient propositional learners, 
as instantiated by LINUS [3, DINUS [5], and others systems |9I6J . 

LINUS [7] and DINUS [8] both use a restricted class of expressions (con- 
strained and determinate literals), whereas our approach handles arbitrary gen- 
eral classes. In terms of the general transformation framework introduced in this 
paper, systems such as LINUS and DINUS can be seen as using a suitably re- 
stricted set of classes C* in order to ensure that {val{C, e) | (C £ C, e G E)} is a 
singleton set, thus allowing the values of this single tuple to be used as features 
without further transformations. 

Our approach is also related to |S] and |S] where general classes are allowed, 
but where the subsequent transformation mainly consists of checking whether 
such a class has instances or not (the function defined above). In contrast, 
our approach uses a more general class of transformation functions including 
aggregation and counting, thus using the information in val(C,e) in a more 
sophisticated way. 

Finally, our approach of generating the clause set C is closely related to the 
declarative bias language of Midos M- Here, we supplement the set of foreign 
links C by information about functional dependencies J- between predicates of 
examples E and background knowledge B. This supplement ensures efficiency 
by avoiding redundant features during the propositionalisation process for larger 
multirelational databases. 

Note that our approach is open towards using different ways of generating 
C, e.g. by stochastic search as suggested in [6], or based on different biases as in 

0 . 

6 Conclusion 

In this paper, we have presented RELAGGS, a transformation based ILP learner 
specifically geared to the challenges presented by business databases, which are 
often structurally quite simple, but large in size with massive amounts of non- 
determinacy. We have presented a general framework for such transformation- 
based learners which encompasses also the simpler transformations of [SEE], 
and have described how aggregation operations can naturally be used to define 
the transformation functions for our learner. In addition, we have shown how 
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functional dependencies can be used to eliminate redundant features as long as 
the transformation function is local. 

Our experimental evaluation on eight learning problems from two business 
domains shows that indeed such applications have special properties that make 
them difficult both for determinate literal based transformation learners as well 
as for non-transformation based learners that are geared more towards struc- 
turally complex domains. While RELAGGS beats these learners in a statistically 
significant way, our experiments did not show that much of a difference between 
variants of RELAGGS using G4.5rules and SVMs as propositional learners, with 
slight advantages for RELAGGS with G4.5rules. 

In future work, we will evaluate RELAGGS on further domains, in particular 
problems with an even larger number of tuples and relations. It will be interesting 
to see, if the parity between G4.5rules and SVMs holds up when the number of 
relations and thus the number of features, is further increase d! W s will then 
also investigate whether some of the results of selecting or filtering features 
heuristically during propositionalisation could be of use (cf. [6,10]). Finally, we 
will work on the transformation function of RELAGGS, incorporating further 
aggregate descriptors. 
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Abstract. In recent times, there is a growing interest in both the extension of 
data mining methods and techniques to spatial databases and the application of 
inductive logic programming (ILP) to knowledge discovery in databases 
(KDD). In this paper, an ILP application to association rule mining in spatial 
databases is presented. The discovery method has been implemented into the 
ILP system SPADA, which benefits from the available prior knowledge on the 
spatial domain, systematically explores the hierarchical structure of task- 
relevant geographic layers and deals with numerical aspatial properties of spa- 
tial objects. It operates on a deductive relational database set up by selecting 
and transforming data stored in the underlying spatial database. Preliminary 
experimental results have been obtained by running SPADA on geo-referenced 
census data of Manchester Stockport, UK. 



1 Introduction 

One of the great challenges for the near future is knowledge discovery in ever grow- 
ing spatial sets [4]. Nevertheless, most work in the KDD community up to now has 
been almost exclusively focused on pattern discovery in relational and transaction 
databases. Only in recent times, data mining methods and techniques have been pro- 
posed for the extraction of implicit knowledge, spatial relations, or other patterns not 
explicitly stored in spatial databases [8]. Peculiarity of the spatial domain is that the 
attributes of the neighbors of some spatial object of interest may have an influence on 
it and therefore have to be considered as well [6]. Thus, spatial data mining algo- 
rithms cannot neglect the implicit relations of spatial neighborhood (e.g. topological 
relations) that are defined by the explicit location and extension of spatial objects. 

As the interest in KDD is generally increasing, many recent applications of ILP 
methods and techniques to KDD have also emerged [3]. We claim that spatial data 
mining is a promising ILP application domain for two main reasons. First, ILP relies 
on the theory of computational logic which supplies representation and reasoning 
means appropriate for the spatial domain where relations among objects play a key 
role and are often inferred by qualitative reasoning. Second, ILP offers an elegant 
solution to multi-relational mining whereas traditional approaches to spatial data 
mining usually solve the problem by collapsing multiple relations into the universal 
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relation [9]. To the best of our knowledge, very few contributions from ILP to knowl- 
edge discovery in spatial databases have been reported in the literature. GwiM [14] is 
a general-purpose ILP system that can solve several spatial data mining tasks, though 
no insight in the algorithmic issues has been provided. INGENS [11] is an inductive 
GIS with learning capabilities that currently support the classification task. 

In this paper, we focus our attention on the task of mining spatial association rules, 
namely the detection of associations between spatial objects, and propose to accom- 
plish the task by means of a novel special-purpose ILP system, called SPADA (Spa- 
tial PAttern Discovery Algorithm) [12]. It benefits from the available prior knowl- 
edge on the spatial domain, systematically explores the hierarchical structure of task- 
relevant geographic layers and deals with numerical aspatial properties of spatial 
objects. Furthermore, it operates on a deductive relational database (DDE) set up by 
selecting and transforming data stored in the underlying spatial database. The analysis 
of geo-referenced census data have been chosen as an application domain. Indeed, the 
advances in the practice of geo-referencing socioeconomic phenomena allow census 
data to be conceptualized as spatial objects with numerical aspatial properties. 

The paper is organized as follows. Section 2 introduces the spatial data mining 
problem solved by SPADA. Experimental results on geo-referenced census data of 
Stockport, one of the ten Metropolitan Districts of Greater Manchester, UK, are re- 
ported in Section 3. Conclusions are given in Section 4. 



2 Mining Spatial Association Rules with SPADA 

The discovery of spatial association rules is a descriptive mining task aiming at the 
detection of associations between reference objects and task-relevant objects, the 
former being the main subject of the description while the latter being spatial objects 
that are relevant for the task at hand and spatially related to the former. For instance, 
we may be interested in describing a given area by finding associations among large 
towns (reference objects) and spatial objects in the road network, hydrography and 
administration layers (task-relevant objects). Some kind of taxonomic knowledge on 
task-relevant geographic layers may also be taken into account to get descriptions at 
different concept levels (multiple-level association rules). As usual in association rule 
mining, we search for associations with large support and high confidence (strong 
rules). Formally, SPADA can solve the following spatial data mining problem; 

Given 

- a spatial database (SDB), 

- a set of reference objects S, 

- some task-relevant geographic layers R,., \<k<m, together with spatial hierarchies 
defined on them, 

- two thresholds for each level I in the spatial hierarchies, minsup[l] and minconf[l\ 
Find strong multiple-level spatial association rules. 

To solve the problem Koperski and Han propose a top-down, progressive refine- 
ment method which exploits taxonomies both on topological relations and spatial 
objects [9]. The method has been implemented in the module Geo-associator of the 
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spatial data mining system GeoMiner [7]. We propose an upgrade of Geo-Associator 
to first-order logic representation of data and patterns. The approach is inspired to the 
work on multi-relational data mining reported in [2] and operates on a DDB set up by 
a preliminary feature extraction step from SDB and denoted D{S). In particular, we 
resort to Catalog [1], whose expressive power allows us to specify also prior knowl- 
edge (BK) such as spatial hierarchies, spatial constraints and rules for spatial qualita- 
tive reasoning. Given a set of Catalog atoms A, a spatial association rule in D(S) is an 
implication of the form P—>Q (s%, c%), where Po4, Qo4, Pr\Q=0, and at least one 
atom in PkjQ represents a spatial relationship. The percentages s and c are called the 
support and the confidence of the rule respectively. An example of spatial association 
rule in our framework is: 

is_a(A, largejown), intersects(A,B), intersects(A,C), is_a(C, regional_road), intersects(D,C), 
D\=A, C\=B ^ is_a(B, main_trunk_road), is_a(D, largejown) (54%, 86%) 

“GIVEN THAT 54% of large towns intersect both a main trunk road and a regional 
road the latter intersecting a large town distinct from the previous one, IF a large 
town A intersects two spatial objects the former being an unknown B while the latter 
being a regional road which in turn intersects some spatial object D distinct from A 
THEN WITH CONFICENCE 86% B is a main trunk road and D is a large town”. 

The choice of an ILP algorithm to accomplish the mining task at hand heavily af- 
fects the whole KCC process. Indeed, D{S) is obtained by selecting and transforming 
the portion of SDB that concerns the set of reference objects S and adding it to the 
BK. Cata selection encompasses the retrieval of spatial objects eventually together 
with their spatial and aspatial properties and the extraction of spatial relationships 
between reference objects and task-relevant objects. In particular, SPADA can extract 
topological relations whose semantics has been defined according to the 9- 
intersection model [5]. It is noteworthy that finding the right compromise between 
on-line computation (time-consuming solution) and materialization (space-consuming 
solution) of spatial relations is a hot topic in spatial data mining. More sophisticated 
computational solutions are reported in [6, 9]. Once selected, this data needs to be 
transformed in a suitable format. For instance, numerical properties of spatial objects 
with a large domain must be discretized in order to be handled by logic -based data 
mining methods. SPADA currently implements an adaptation of the relative unsuper- 
vised discretization algorithm RUDE [10] to the first-order case. 

The spatial data mining step requires the solution to two sub-tasks: 1) Find large 
(or frequent) spatial patterns; 2) Generate strong spatial association rules. The reason 
for this decomposition is that frequent patterns are commonly not considered useful 
for presentation to the user as such. They can be efficiently post-processed into asso- 
ciation rules that exceed given threshold values of support and confidence. It is note- 
worthy that SPADA, analogously to Geo-Associator but differently from WARMR 
[2], exploits is-a taxonomies for extracting multiple-level patterns and association 
rules. Thus, largeness and strength depend on the level currently explored in the 
hierarchical structure of task-relevant geographic layers. To be more precise, a pattern 
P is large (or frequent) at level I if o{P)>minsup[r\ and all ancestors of P with respect 
to the spatial hierarchies are large at their corresponding levels. A spatial association 
rule P—>Q is strong at level / if the pattern PuQ is large and the confidence is high at 
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level I, namely (S^{Q\P)>minconj\l]. In SPADA, the counting procedures for support 
and confidence are based on the coverage test of spatial observations, being it the ILP 
counterpart of counting the number of reference objects that satisfy a certain spatial 
pattern. Indeed, the spatial observations are portions of D{S), each of which concerns 
one and only one reference object. Thus, the two percentages associated to P-^Q 
mean that s% of spatial observations in D(S) are covered by PvjQ and c% of spatial 
observations in D(S) that are covered by P are also covered by PvjQ respectively. 

Further details about representation and algorithmic issues can be found in [12]. 



3 An Application to Stockport Census Data 

In some works on spatial representation from the social scientist's perspective, socio- 
economic phenomena have been conceptualized as spatial objects in the sense of 
entities having both spatial location and spatially independent attribute characteristics 
[13]. Population data are among the potentially spatial socioeconomic data. They are 
usually geo-referenced with respect to areal spatial objects such as census zones, 
electoral constituencies, local government areas, or regular grid squares. In the UK, 
for instance, the geo-referencing areal units are ED (enumeration district). Ward, 
District, and County. They form a hierarchy based on the inside relationship among 
locations. Thus the ED is the smallest unit for which census data are published nowa- 
days. Eurthermore, the digital ED boundaries produced for the 1991 UK census en- 
able the spatial representation of census data in the computer databases. Generally 
speaking, population censuses of the 1990s provided an added impetus to the applica- 
tion of GIS to socioeconomic uses. One of the most interesting topic areas for identi- 
fying potential users of such GIS applications is the public debate over Unitary De- 
velopment Plans (UDP) in the UK. The district chosen for investigation is Stockport, 
one of the ten Metropolitan Districts of Greater Manchester, UK. It is divided into 
twenty-two wards for a total of 589 EDs. The case study is expected to show the 
potential benefit of data mining methods and techniques to one or more potential 
users. In particular, census data are extremely important for policy analysis and, once 
geo-referenced and conceptualized as spatial objects with numerical aspatial proper- 
ties, supply a good test-bed to SPADA. Thus census data (89 tables, each with 120 
attributes in average) and digital ED boundaries have been loaded into an Oracle- 
Spatial database, i.e. a relational DBMS extended with spatial data handling facilities. 
The ED code allows the joining of the two kinds of data and the generation of test 
data. 

We have focused our attention on transportation planning, which is one of key is- 
sues in the UDP. Let us suppose that some decision-making process about the motor- 
way M63 is ongoing. Describing the area of Stockport served by M63 (i.e. the wards 
of Brinnington, Cheadle, Edgeley, Heaton Mersey, South Reddish) may be of support 
to the planners. In this paper we report the preliminary results obtained by applying 
SPADA to the task of discovering multiple-level spatial association rules relating EDs 
intersected by the motorway M63 (S) and all EDs in the area served by M63 {R) to be 
characterized with respect to data about commuting. 
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This spatial data mining query raises several application issues for SPADA. First, 
census data are available at the ED level. Thus, an is-a hierarchy for the Stockport ED 
layer has been obtained by grouping EDs on the basis of the ward they belong to (see 
Eigure 1) and expressed as Datalog facts in BK. Indeed, the current version of 
SPADA deals only with is-a hierarchies where the is-a relationship is overloaded, i.e. 
it may stand for kind-of as well as for instance_of depending on the context. Eurther 
is-a hierarchies could be derived by resorting to clustering algorithms. 



ED 




Fig. 1. An is-a hierarchy for the Stockport ED layer 



Second, census data are all numeric (more precisely, integer values). The attributes 
that we have selected for this experiment (see Table 1) refer to residents aged 16 and 
over, thus they have been normalized with respect to the total number of residents 
aged 16 and over (s820001). Each couple of consecutive cut points a and b has gener- 
ated an interval of the kind [a..b]. 

Last, some spatial computation is necessary. In particular, the relations of intersec- 
tion (EDs-motorways) and adjacency (EDs-EDs) have been extracted as concerns the 
area of interest and transformed into Datalog facts of D(S). It is noteworthy that the 
relations of accessibility and closeness have been defined by means of spatial qualita- 
tive reasoning: 

linked_to(X, Y) intersect(X, m63), intersect(Y, m63), Y\=X. 
close_to(X, Y) adjacent_to(X, Z), adjacent_to(Z, Y), Y\=X. 

These rules have been added to BK together with the aforementioned spatial hierar- 
chies and also the spatial constraint: 
ed_on_M63(X) intersect(X, m63). 
which defines the instances of S. 

SPADA has been run on the obtained D(S) with thresholds min_sup[Y\=Q.l and 
m/«_con/[l]=0.9 at the first level, and min_sup[2\={).5 and m/n_con/[2]=0.8 at the 
second level. The whole discovery process has taken 490.21 sec on a PC Pentium III 
with 128 Mb RAM (37.84 sec for level 1, and 452.31 sec for level 2). It has returned 
744 frequent patterns out of 17619 candidate patterns and 24964 strong rules out of 
40465 generated rules. Some interesting patterns have been discovered. Eor instance, 
at level 1=2 in the spatial hierarchies, the following candidate P\ 

ed_on_M63(A), close_to(A,B), is_a(B, south_reddish_ED), linked_to(A,C), C\=B, 
s820161(C, [52.632..54.167]), is_a(C, cheadle_ED) 
has been generated after k=6 refinement steps and has been evaluated with respect to 
D{S). Since six of ten spatial observations (I5 l=10) are covered and all the ancestor 
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patterns are large at their level (l<2), the pattern is a large one at level 1=2 with 60% 
support. For the sake of clarity, the following pattern 

ed_on_M63(A), close_to(A,B), is_a(B, ed_in_M63_area), linked_to(A,C), C\=B, 
s820161(C, [52.632..54.167]), is_a(C, ed_in_M63_area) 
is one of the large ancestors for the pattern P. It has been generated after k=6 refine- 
ment steps at level /=! and is supported by 90% EDs intersected by M63. Such way 
of taking the taxonomies into account during the pattern discovery process imple- 
ments what we refer to as the systematic exploration of the hierarchical structure of 
task-relevant geographic layers. Furthermore, the use of both variables and atoms of 
the kind \= allow SPADA to distinguish between multiple instances of the same class 
of spatial objects (e.g. the class ed_in_M63_area). 



Table 1. Numerical attributes in the application to Stockport census data. 



Attribute 


Description 


Cut points in the attribute domain 


S820161 


Persons who work out of the 
district of usual residence and 
drive to work 


0.0, 6.25, 8.333, 12.973, 17.241, 19.048, 
20.943, 23.529, 25.0, 25.926, 27.586, 
29.032, 29.865, 31.25, 33.333, 34.375, 
36.182, 38.235, 40.0, 42.105, 45.455, 
46.667, 48.194, 50.0, 51.515, 52.632, 
54.167, 56.0, 57.143, 58.333, 58.824, 60.0, 
60.714, 61.538, 63.889, 65.217, 66.667, 
67.742, 69.565, 71.429, 72.902, 100.0 


S820213 


Employees and self-employed 
who reside in households with 3 
or more cars and drive to work 


0.0, 2.222, 15.385, 28.0, 29.521, 31.034, 
33.333, 35.068, 37.5, 38.095, 38.889, 
41.043, 42.857, 48.387, 72.727 


S820221 


Employees and self-employed 
who reside in households with 3 
or more cars and work out of the 
district of usual residence 


0.0, 2.222, 4.762, 9.091, 10.345, 13.636, 
18.182, 19.355, 21.131, 23.529, 25.0, 
28.571 



One of the strong rules that have been derived from the frequent pattern P is: 

ed_on_M63(A), close_to(A,B), is_a(B,south_reddish_ED) 

linked_to(A,C), is_a(C,cheadle_ED), B\=C, s820161(C,[52.632..54.167]) (60%, 100%) 
“GIVEN THAT 60% of EDs intersected by M63 are close to a South Reddish ED and 
are linked via M63 to a Cheadle ED where 52-54% residents aged 16 and over work 
out of the district of usual residence and drive to work, IF an ED intersected by M63 
is close to a South Reddish ED THEN WITH CONFIDENCE 100% it is linked via 
M63 to a Cheadle ED where ... ”. 

Other examples of strong rule at the second level are: 

ed_on_M63(A), close_to(A,B), s820221(B,[10.345..13.636]) 

^ linked_to(A,C), is_a(C,brinnington_ED), B\=C (60%, 86%) 

“GIVEN THAT 60% of EDs intersected by M63 are close to an ED - where 10-13% 
residents aged 16 and over are employees and self-employed who reside in house- 
holds with 3 or more cars and work out of the district of usual residence - and are 
linked via M63 to a Brinnington ED distinct from the previous one, IF an ED inter- 
sected by M63 is close to an ED where ... THEN WITH CONFIDENCE 86% it is 
linked via M63 to a Brinnington ED distinct from the previous one ”. 
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ed_on_M63(A), close_to(A,B), s820221(B,[19.355..21.131]) 

is_a(B,heaton_mersey_ED) (70%, 100%) 

“GIVEN THAT 70% of EDs intersected by M63 are close to a Heaton Mersey ED 
where 19-21% residents aged 16 and over are employees and self-employed who 
reside in households with 3 or more cars and work out of the district of usual resi- 
dence IF an ED intersected by M63 is close to an ED where ... THEN WITH 
CONFIDENCE 100% the latter ED belongs to the ward of Heaton Mersey" . 

One may wonder whether these frequent patterns and strong rules convey novel 
knowledge and, in positive case, what kind of knowledge. The evaluation of data 
mining results is beyond the scope of this paper. Nevertheless a naive interpretation 
of results in our application might lead us to state that the motorway M63 intersects 
an area of Stockport which is characterized by a high percentage of commuters by car 
who may benefit from some improvement of the road network. 



4 Conclusions and Future Work 

The work presented in this paper reports an ILP application to spatial association rule 
mining. Experimental results obtained by applying the novel special-purpose ILP 
system SPADA to geo-referenced census data of Manchester Stockport show that the 
expressive power of first-order logic enables us to tackle applications that cannot be 
handled by the traditional approach to spatial data mining. Furthermore, DDEs offer 
effective representation means for domain knowledge, constraints and qualitative 
reasoning. In particular, we can embed rules for the inference of implicit spatial rela- 
tionships that are too numerous to be either stored in the spatial database or computed 
by computational geometry algorithms. 

For the near future, we plan to face the issues of efficiency and scalability in 
SPADA. Particular attention will be also drawn on the issue of robustness. Indeed, 
data pre-processing in spatial data mining is remarkably error-prone. For instance, the 
generation of the predicate closejo is based on the user-defined semantics of the 
closeness relation, which should necessarily be approximated. Further work on the 
data selection and transformation is expected to give some hints on noise handling in 
this application domain. As for the test on real-world spatial data sets, much work has 
still to be done. In particular, we are interested in experiments with mixed census- 
topographic data because they show that the interpretation of spatial relations can 
change as spatial objects are added. 
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Abstract. The covering test intensively used in Inductive Logic Pro- 
gramming, i.e. ^-subsumption, is formally equivalent to a Constraint 
Satisfaction problem (CSP). This paper presents a general reformulation 
of 0-subsumption into a binary CSP, and a new 0-subsumption algo- 
rithm, termed Django, which combines some main trend CSP heuristics 
and other heuristics specifically designed for 0-subsumption. 

Django is evaluated after the CSP standards, shifting from a worst-case 
complexity perspective to a statistical framework, centered on the notion 
of Phase Transition (PT). In the PT region lie the hardest on average 
CSP instances; and this region has been shown of utmost relevance to 
ILP [4]. Experiments on artificial 0-subsumption problems designed to 
illustrate the phase transition phenomenon, show that Django is faster 
by several orders of magnitude than previous 0-subsumption algorithms, 
within and outside the PT region. 



1 Introduction 

Supervised learning intensively relies on the generality operator, or covering 
test, calculating whether a given hypothesis covers a given example. As the 
evaluation of a candidate hypothesis depends on its coverage, the covering test 
must imperatively be efficient. 

The complexity of the covering test is one main concern facing Inductive 
Logic Programming (ILP) |1(JI12J . The covering test commonly used in ILP, 
i.e. 0-subsumption m, is exponentially complex in the size of the candidate 
hypothesis. How to manage this complexity has motivated numerous studies on 
learning biases, restricting the size and/or the number of hypotheses explored 
through syntactic or search biases m- In parallel, new algorithms for achieving 
efficient 0-subsumption mE\ and ILP learners based on a correct approximation 
of 0-subsumption |T^, have been proposed. 

In this paper, is presented a new correct and complete 0-subsumption algo- 
rithm termed Django, based on a Constraint Satisfaction Problem (CSP) ap- 
proach. 

Although it is long known that 0-subsumption is equivalent to a Constraint 
Satisfaction problem (CSP), ILP problems have only recently been put in a 
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CSP perspective [I3l4j . The focus is thereby shifted from a worst-case complexity 
analysis, to a statistical approach j^. 

The covering test complexity is handled as a random variable, measuring 
the computational cost of 0-subsumption for some order parameters (e.g. the 
number of variables & predicates in the hypothesis, the number of literals & 
constants in the example). Surprisingly, the computational cost is almost zero 
for most problems, referred to as trivial. For instance, assuming that all pred- 
icates in the hypotheses also appear in the examples, a short hypothesis will 
cover almost surely all examples; inversely, a long hypothesis will almost surely 
cover no example at all. In both cases, the 0-subsumption cost remains low as 
the 0-subsumption problem corresponds to an under or over-constrained satis- 
faction problem. But in a narrow region, termed phase transition (FT), where 
the probability for a hypothesis to cover an example is close to 50%, the covering 
test reaches its maximum complexity on average [3]. 

The PT phenomenon is of utmost importance for ILP, for two reasons. First, 
there is ample evidence of phase transition in artificial problems statistically 
modeled from ILP real-world problems |3]. Second, intensive experimentations 
on artificial problems have shown that this region behaves as an attractor on 
existing ILP learnerf0 |1] . 

This paper is concerned with designing a 0-subsumption algorithm with 
good average performances on the most relevant and critical instances of 0- 
subsumption problems, i.e. lying within the PT. To this aim, is first presented a 
general transformation of a 0-subsumption problem referred to as primal prob- 
lem, into another constraint satisfaction problem, termed dual problem. Along 
the transformation, each literal (involved in the hypothesis, primal CSP) be- 
comes a constrained variable in the dual CSP; conversely, a variable in the primal 
CSP derives a set of dual constraints; furthermore, specific constraints encoding 
the 0-subsumption structure are automatically generated. On the dual CSP is ap- 
plied a combination of well-known CSP algorithms, forming the Django system. 
The approach is validated on artificial 0-subsumption problems designed after 
to sample the Phase Transition region. Intensive experiments show that Django 
improves by several orders of magnitude on average on all problems within and 
outside the PT in the considered range, compared to previous 0-subsumption 
algorithms |8I18J . 

The paper is organized as follows. Next section briefly introduces 0-subsump- 
tion and reviews existing 0-subsumption algorithms |8ll8j . Section El presents 
the Constraint Satisfaction framework and the main heuristics used to solve 
CSPs. Section m describes the transformation of a 0-subsumption problem into 
a dual binary CSP, and presents the combinations of CSP heuristics involved in 
the Django system. Experimental setting and results are reported and discussed 
in section |5] and the paper ends with some perspectives for further research. 



^ E.g. for almost all target concepts, FOIL [m selects its final hypotheses in the PT 
region. 
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2 0-Subsumption, Definition and Algorithms 

Let hypothesis C denote a conjunction of literals with no function symbols, and 
argiC) denote the set of variables in C. Example Ex likewise is a conjunction 
of literals with no function symbols, arg(Ex) being the set of variables and 
constants in Ex. 

By definition, C 0-subsumes Ex according to 9, iff 0 is a mapping from arg(C) 
onto arg(Ex), mapping variables in C onto variables and constants in Ex such 
that C9 be included in Ex. Instead of mapping variables in C onto variables and 
constants in Ex, it is often more computationally efficient to map literals in C 
onto those literals in Ex built on the same predicate symbol. Through a literal 
mapping, each variable in C is associated to a set of variables and constants in 
Ex', the literal mapping is termed consistent if it maps each variable in C onto a 
single variable or constant in Ex. 

The main stream algorithm for 0-subsumption is based on Prolog SLD reso- 
lution m- It performs a depth first exploration of literal mappings (associating 
to the first literal in C the first literal built on the same symbol in Ex, and so 
on), and it backtracks if an inconsistency occurs (e.g. one variable in C is asso- 
ciated to two constants in Ex). Literals in C and E are explored in their order 
of apparition, which has a significant impact on the SLD efficiency, as known by 
all Prolog programmers. 

A first improvement has been proposed by [S], based on the notion of de- 
terminate matching. It consists of reordering the literals in C = p\..pK in 
such a way that, if possible, there is a single candidate literal p' = 9{pi) in 
Ex for Pi in C, which is consistent with the previous assignments (such that 
{pi/0(pi), ..pi/9{pi)} is consistent). After all determinate literals in C have been 
mapped onto literals in Ex, and if necessary, the search resumes using SLD 
resolution. 

The scope of determinate matching is extended by [^, using a graph context 
to prune the candidate literals. To each literal p in C (resp. in Ex) is associated 
its neighborhood; the 1-neighbors ofp are all literals sharing at least one variable 
(resp. one variable or one constant) with p ; the i-th neighbors are recursively 
constructed, as 1-neighbors of i — 1-neighbors. It is shown, that, unless all pred- 
icate symbols occurring in p j-th neighbors also occur among p' i-neighbors, p 
cannot be mapped onto p' , and the latter can be removed from the candidate 
literals for p. pg further define a substitution graph, connecting two pairs of 
literals {p,p') and {q, q') iff mapping {p/p' , q/q') is consistent. The SLD search is 
replaced by a maximal clique search in the substitution graph. The worst-case 
complexity remains exponential, but the advantage is to perform the consistency 
check only once. 

Another heuristics used by |S] proceeds by decomposing the substitution 
graph into mutually independent components {k-locality) . Such a decomposition 
significantly reduces the complexity of the problem. . 



6'-Subsumption in a Constraint Satisfaction Perspective 



167 



3 Constraint Satisfaction Problem 

This section briefly introduces CSPs together with the main stream heuristics; 
the reader is referred to | 21 | for a comprehensive presentation. 

A CSP involves i) a set of variables X\, A„, with dom{Xi) being the value 
domain for Xi, and ii) a set of constraints, specifying the simultaneously ad- 
missible values of the variables. A constraint can conveniently be thought of 
as a predicate ..Xi^), while the admissible values are described as a set 

of literals r(api^, ..opi„), . . . ..a;y„), aj^i^ G dom{XiJ. The constraint 

scope, noted arg(r) is the set of variables Xi-^^.-Xi^,. The constraint domain, 
noted dom{r), is the set of literals built oiJlr. 

A CSP solution assigns to each variable Xi a value in dom{Xi) such that 
all constraints are satisfied; it can be viewed as a mapping 9 = {Xi/ai} such 
that for each constraint r{Xi^, ..Xi^^), r9 = r{ai^, ..ai^,) belongs to dom{r). In 
other words, the CSP defined by constraints ri,..rK is satisflable iff ri,..rK 
0-subsumes the conjunction dom(ri), ..dom(rK)- Likewise, the CSP complexity 
is exponential in the number n of variables, and linear in the number m of 
constraints: if |o| is the number of possible values for a variable, the complexity 
is 0(|a|" X m). (A first way of decreasing the complexity is by decomposing the 
CSP into fewly related subproblems - hierarchizing the set of variables |Bj or the 
set of constraints [2] - in the same spirit as fc-locality ED- 

Two CSPs are equivalent iff they are defined on same variables and admit 
same solutions. As any CSP can be embedded into a binary CSP, i.e. with 
binary constraints only, most CSP algorithms only consider binary and unary 
constraints. Further, with no loss of generality, one assumes that there exists at 
most one constraint on each variable pair. 

CSP algorithms are made up of two kinds of heuristics. Reduction heuristics 
are meant to transform a CSP into an equivalent CSP of lesser complexity, 
through reducing the variable domains. Search heuristics are concerned with the 
search strategy. 

3.1 Reduction 

Reduction proceeds by pruning the candidate values for each variable X . Value 
a in dom{X) is locally eonsistent if, for all variables Y such that there exists 
a constraint r{X,Y), there exists some candidate value b in dom{Y) such that 
r(a, b) holds (belongs to dom{r)). Clearly, if a is not locally consistent, X cannot 
be mapped onto a, which can thus soundly be removed from dom{X). 

Local consistency is extended as follows; a is fc-consistent with X if for each 
set of constraints ri(A, Fi), r 2 (li, V 2 ), Lfc), there exists a, k — 1 -tuple 

( 6 i,..,&fc) such that ri(a, 61 ), r 2 (&i, & 2 ), &fe), holds. 

^ In all generality, the constraint domain can be infinite (e.g. a numerical constraint 
on real- valued variables). Only finite domains are considered in the rest of the paper. 
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A CSP is fc-consistent iff each value domain dom{Xi) includes /c-consistent 
values only - and is not emptjj^. Checking fc-consistency is exponentially complex 
with respect to k] therefore, only 2-consistency, or arc consistency is used in 
practice. The best complexity of reduction algorithms is 0(m\a\'^), with m being 
the number of constraints and |a| the value domain size. 

3.2 Search 

CSP algorithms incrementally construct a solution {Xi/ui} through a depth 
first exploration of the substitution tree; a node corresponds to a variable Xi, to 
which is assigned some candidate value a^. On each assignation, consistency is 
checked; on failure, another candidate value for the current node is considered; 
if no other value is available, the search backtracks. 

Several approaches have been proposed in order to improve: (i) the back- 
tracking procedure (look-back heuristics); (ii) the choice of the next variable and 
candidate value to consider (look-ahead heuristics). 

Look-back heuristics aim at preventing the repeated exploration of a same 
substitution subtree on backtracking (thrashing). For instance. Conflict Based 
Jumping (CBJ) [H] registers all variable conflicts occurred during the explo- 
ration, which allows for backtracking directly to the appropriate tree level. On 
the other hand, it may happen that the overhead due to maintaining the conflict 
registers offsets the look-back advantages for some particular CSP instances. 

Look-ahead heuristics aim at minimizing the number of assignments consid- 
ered. The best known look-ahead heuristics is constraint propagation; in each 
step, the candidate values which are inconsistent with the current assignment, 
are pruned. This way, inconsistencies are detected earlier and less nodes are 
visited; in counterpart, the assignment operation becomes more expensive as it 
involves the constraint propagation step. 

Forward checking (FC) employs a limited propagation, only pruning the candi- 
date values for the next variable (partial arc-consistency). Maintaining arc con- 
sistency (MAC) checks the arc-consistency on each variable assignment. Again, 
the overhead due to constraint propagation might offset its advantages on medium- 
size weakly constrained CSPs. Currently, the most generally efficient algorithms 
combine FC and CBJ. 

In addition, the variable order can be optimized, either statically (once for all), 
or dynamically (the yet unassigned variables are reordered on each assignment) . 
Dynamic variable ordering is generally more efficient than static variable order- 
ing. One criterion for reordering the variables is based on the First Fail Principle 
|T], preferring the variable with the smallest domain. This way, failures will occur 
sooner rather than later. 

Last, the candidate values can be ordered too; the value with less conflicts with 
the other variable domains is commonly preferred. 

® The use of graph contexts to prune the candidate literals [18] can be viewed as a 
fc-consistency check (more on this in section 
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4 CSP Heuristics for 0-Subsumption 

This section formalizes 0-subsumption as a binary CSP problem, and presents a 
new combination of CSP heuristics for 0-subsumption, Django. 

4.1 Representation 

It has been shown (section 0 that a CSP problem is equivalent to a 0-subsumption 
problem. However, 0-subsumption generally considers n-ary predicates. An ad 
hoc representation is thus necessary to enable the use of standard CS heuristics. 

C : te(Xo),p(Xo,Xi),_p(Xi,X 2 ),p(X 2 ,X 3 ),</(Xo,X 2 ,X 3 ) 

Ex : tc(ao),p{ao, ai),p(ai, a2),p{a2, a3),p{a3, a4),p{ao, 03), q{ao, 02, 03), g(ao, di, (I3) 

We choose to consider the dual constraint satisfaction problem defined as 
follows. Each dual variable Ypj corresponds to a literal in C, namely the i-th 
literal built on predicate symbol p (subscript .i will be omitted for readabil- 
ity when there is a single literal built on the predicate symbol); its domain 
dom(Yp,i) is the set of all literals in Ex built on the same predicate symbol p, 
e.g. dom(Yp,i) = {p(oo, ai),p(ai, 02),p(a2, tt3),p(a3, a4),p(oo, 03)}. 

A dual constraint r(Yp,i,Ygj) is set on a variable pair {Yp,i,Yq,j) iff the cor- 
responding literals in C share a (primal) variable; for instance, as tc(Ao) and 
p{Xq, Xi) share variable Xq, there is a dual constraint linking Ytc and This 
constraint specifies that, for each literal p' in dom{Yp,i), there must be a literal 
q' in dom{Yq j) such that the literal mapping {p-i/p' ,q.j /q'} is consistent with 
respect to 0-subsumption. In our toy example, dual constraint r(Ytc, Pp.i) is only 
satisfied for the dual value pair (te(oo),p(ao, oi)). 

The difference between such dual constraints and the substitution graph in m 
is that the substitution graph specifies whether a given literal assignment {p/p'} 
is consistent with another one {q/q'}. In contrast, dual constraints require that, 
for each pair of literals p,q in C sharing one variable, there exists a pair p' , q' of 
literals in Ex, such that [p/p' ,q/q'} is consistent. 

The dual CSP is enriched by associating to each dual variable (literal in 
C) and candidate value (literal in Ex) a signature, encoding the literal links 
(shared variables) with all others literals. For instance, the signature associated 
to p{Xq,Xi) states that the first variable appear in a literal built on symbol tc, 
position 1, and a literal built on p, position 1; and the second variable appear in 
a literal built on symbol p, positions 1 and 2. Signatures allow one to prune the 
candidate literals through arc-consistency, in a similar way to graph contexts 
m-, the signature of the literal in C must be included in the signature of the 
candidate literal in Ex. The difference is that signatures are deliberately limited 
to depth 1 (only 1-neighborhoods are considered) , which allows for an optimized 
implementation. 

Last, the case of literals sharing several variables is considered; signatures 
associated to pairs of such literals, termed 2-signatures, are designed and used 
to prune candidate literals too. 
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4.2 Resolution 

As mentioned earlier, there is no such thing as a universally efficient CSP heuris- 
tics; it is thus desirable to evaluate carefully how relevant a given CSP heuristics 
is wrt 0-subsumption problems. Several combinations of heuristics have been 
experimented in Django (summarized in Table 1). 

The baseline version VI combines arc-consistency checking and forward 
checking (the propagation of the current assignment is restricted to the next 
variable domain). 



Table 1. Django, Versions VI to V8. 



Base line 

VI Arc consistency -I- simple Forward Checking 

(propagation of the current assignment wrt the next variable) 

Dynamic Variable Ordering 
V2 VI -I- DVO based on minimal domain 
(random choice in case of tie) 

V3 VI -b DVO based on maximal connectivity 
(random choice in case of tie) 

V4 VI -b DVO based on min. domain -b max. connectivity 
(minimal domain, then maximal connectivity) 

V5 VI -b DVO based on max. connectivity -b min. domain 
(maximal connectivity then minimal domain) 

Forward Checking 

V6 V4 -b improved Forward Checking 

propagation of forced assignments (singleton candidate value) 

Arc Consistency 

V7 V6 + AC based on signatures 

V8 V7 + AC based on signatures and 2-signatures 



We first investigate the influence of variable ordering on the search efficiency. 
Versions V2 to V5 implement several dynamic variable orderings, all based on 
the First Fail Principle. In V2, variables with minimal domain are ranked firs10. 
In V3, variables subject to a maximal number of dual constraints are ranked 
first (prefer the literals in C which are most connected to others literals). Both 
criteria are combined in versions V4 and V5, with different priorities. 

Secondly, we investigate the influence of Forward Checking. In Version V6, 
besides the 1-step propagation of the current assignment, forced assignments 
(singleton candidate value for any variable) are propagated. 

Last, we investigate the influence of arc consistency, using signatures and 
2-signatures. Version V7 differs from Version V6 as it considers the literal 
signatures; version V8 considers both signatures and 2-signatures. 

^ Note that the determinate matching heuristics in the primal 0-subsumption problem 
[8] corresponds to a particular case of the minimal domain heuristics with regard to 
the dual CSP. 
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5 Experimental Validation 

5.1 Experimental Setting 

As mentioned earlier, CSP algorithms are mainly tested in the PT region, which 
concentrates the hardest on average problems. 

Following |3j, artificial data were constructed to examine the algorithm be- 
havior within and outside the PT region. Artificial 0-subsumption problems 
(pairs (hypothesis C, example Ex)), are constructed from four order parame- 
ters: the number n of variables in C; the number m of literals in C, all built 
on distinct predicate symbols; the number N of literals built on each predicate 
symbol in Ex; the number L of constants in Ex. 

In order to keep the total computational cost beyond reasonable limits, n is 
set to 10, N is set to 100, m varies in [10,50] and L varies in [10,50]. For each 
pair (m, L), 1,000 pairs (hypothesis C, example Ex) are constructed with random 
uniform distribution [3], and cost{m, L) is reported as the average 0-subsumption 
cost over all 1,000 trials, measured in seconds {Django is implemented in 
and runs on a PC Pentium2). All Django versions are experimented and com- 
pared with three 0-subsumption reference algorithms, respectively SLD Prolog, 
determinate matching [8j and graph contexts (18]; in the latter cases, we used 
the algorithm implementation kindly given by T. Scheffer. 

5.2 Results and Discussion 

As might have been expected, Prolog SLD does not keep up when hypothesis 
C involves more than a few literals, and it had to be stopped for m > 5 (being 
reminded that example Ex involves 100 x m literals). Determinate matching [S| 
does significantly better than SLD for small size hypotheses; however, it runs 
out of resources for m > 10; in retrospect, this heuristics is poorly suited to the 
random structure of the examples. 



















-<o 









Fig. 1. ^-subsumption cost(m, L) for Django.'Vl, averaged on 1,000 pairs (C, Ex) with 
m : nb predicates in C in[10,50], L nb constants in Ex in[10,50] 
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Graph contexts also turned out to be hardly applicable, mostly for efficiency 
reasons (see below); finally only the maximal clique search (MCS) |T^ was ex- 
perimented in the same range as Django. 

The behavior of each algorithm is conveniently pictured as the surface 
cost{m,L). Fig. 1 displays the cost landscape obtained for the baseline version 
of Django. 

In a CPS perspective 0, three complexity regions are distinguished. The PT 
region appears as a mountain chain of hyperbolic shape in the (m, L) plane; it 
concentrates the hardest on average 0-subsumption problems. 

The YES region, besides the PT (for low values of m or L), contains trivial 
(hypothesis,example) pairs, where the hypothesis almost surely subsumes the 
example; in this region typically lie overly general, complete and incorrect hy- 
potheses wrt the dataset. 

The NO region, beyond the PT, contains trivial (hypothesis, example) pairs, 
where the hypothesis almost never subsumes the example; in this region lie the 
hypotheses covering no training examples, which are thus found to be correct. 

The cost landscape obtained for MCS m is depicted in Fig. 2 (higher costs 
by factor 6 compared to Fig. 1). Interestingly, the phase transition region is 
larger than for Django.Nl. Note that the complexity is not negligible in the 
NO region. This suggests that MCS does not early detect the inconsistencies, 
achieving unnecessary exploration of the substitution graph in the NO region. 

On the other hand, MCS m first step concerns the construction of the 
whole substitution graph, which is exploited in the second step along a maximal 
clique search. This first step is computationally heavy; it represents a significant 
amount of the total cost, unless the problem size is large. For large-size prob- 




Fig. 2. 0-subsumption cost(m, L) for MCS, averaged on 1,000 pairs {C,Ex) m : nb 
predicates in C in[10,38], L nb constants in Ex in[10,50] (scale factor x 6 compared to 
Fig. 1) 
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lems, the graph construction effort is negligible compared to the maximal clique 
search, and worthwhile as it significantly speeds up the maximal clique search. 
Unfortunately, the memory resources needed to store the substitution graph for 
large problems, are hardly tractable; no 2-graph contexts could be used with 
MCS for m > 10. 

In contrast, Django interleaves the search and the constraint propagation; 
this way, the construction of the whole graph is avoided whenever a solution 
might be found along the search. 

The 0-subsumption costs (with multiplicative factor 100) are summarized in 
Table 2, averaged over all three regions. 



Table 2. Average 0-subsumption cost (xlOO) in the YES, NO and PT regions. 





YES 


region 


Phase Transition 


NO 


region 




cost 




cost 


± 


cost 


± 


MCS 


344.69 


589.93 


841.53 


1325.83 


800.74 


908.22 


Django.W 1 


20.25 


32.80 


116.83 


142.01 


6.67 


17.25 


V2 


4.18 


5.97 


4.99 


6.92 


2.19 


2.71 


V3 


4.80 


6.70 


8.79 


11.84 


2.44 


3.33 


V4 


4.25 


6.09 


4.56 


6.45 


2.22 


2.75 


V5 


4.51 


6.44 


6.70 


9.77 


2.33 


3.09 


V6 


4.24 


5.86 


4.79 


6.32 


1.98 


2.45 


V7 


2.20 


3.15 


3.53 


4.55 


1.58 


1.93 


V8 


2.48 


3.78 


3.15 


4.40 


0.95 


1.41 



Cost {m,L) is counted in the YES, PT or NO region, depending on the 
fraction / of clauses C subsuming examples Ex, over all pairs (C,Ex) generated 
to estimate cost{m,L) (with YES region [f > 90%]; PT region =d [f £ 
[10%, 90%]]; NO region =d [/ < 10%]). 

Some care must be exercised when interpreting the results, due to the high 
variability of the measures; this variability was hardly reduced by increasing the 
number of experiments for a given pair (m,L) from 100 to 1,000 trials. 

This variability is explained as only “simple” Forward Checking and Arc- 
Consistency heuristics were considered. Though these heuristics are very efficient 
on average, they do not manage well with “pathological” cases, which consider- 
ably increases the average resolution cost. Further experiments will consider the 
use of Maintained Arc Consistency heuristics, and see whether the gain achieved 
on hard 0-subsumption problems (as MAC is optimal with regard to worst-case 
complexity) compensates for the loss on easier problem instances. 

The phase transition phenomenon is most marked for Django. NX (Fig. 1), 
the cost in the PT region being 10 times the cost in the YES region and 20 times 
the cost in the NO region. Note that the performance gain of Django.Nl over 
MCS is not uniform; the gain factor is about 30 in the YES region, 7 in the PT, 
and 200 in the NO region. 
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The addition of dynamic variable ordering heuristics visibly improves the 
performances, especially in the PT region, smoothening the complexity peak. 
Other heuristics, especially signature-based heuristics, also seem to contribute 
to the global efficiencj0. The best gain factor compared to MCS is about 100 in 
the YES region, 200 in the PT region, and 700 in the NO region (Fig. 3). 




Fig. 3. 0-subsumption cost(m, L) for Django.'VS, averaged on 1,000 pairs (C, Ex) (scale 
factor 25 compared to Fig. 1) 



Similar gain factors have been obtained for experiments (not shown for space 
limitations) on the A^-queen problem, for N = 10. .30. 

Last, artificial problems derived from the real-world Mutagenesis problem 
m have been used to compare Django and MCS. Each hypothesis C considered 
involves m literals and n variables, where m and n respectively range in 1..10 
and 2. .10; C is tested against all 229 examples in the training set. For a given 
m and n, C is randomly generated from m bond literals bond{Xi, Xj), where Xi 
and Xj are each selected among n variables in such a way that X^ ^ Xj and C 
is connected. 

Results obtained with Django show the presence of phase transition when the 
number of literals and variables in the hypothesis are around 4 and 5 respectively, 
though this change in the covering probability is not coupled with a complexity 
peak. The worst effective complexity is observed for hypotheses with n literals 
and n + I variables (chains of atoms). 

MCS obtains good results on the “artificial mutagenesis” 6*-subsumption 
problems. Since a single predicate symbol is actually considered, the substi- 

® This contrasts with the inefficiency observed for graph contexts [Ts] , though formally 
equivalent to signatures. However, this seems to be mostly due to implementation 
matters: for the sake of generality, 1-neighborhoods are implemented as lists of lists, 
whereas signatures are coded as boolean vectors. 
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tution graph is not relevant, and the maximal clique search efficiently solves the 
search. On this problem, Django outperforms MGS by a gain factor between 50 
and 700. 

5.3 Scope and Relevance of the Experimental Study 

Artificial problems considered in the paper differ from real-world 0-subsumption 
problems encountered in ILP in three respects. 

In the general case, a major issue is to decompose the problem at hand into 
fewly or not related subproblems m (e.g. decomposing the hypothesis into k- 
local components as successful decomposition entails exponential savings 

in the resolution cost. 

In this study, artificial problems are designed in such a way that they are not de- 
composable into two or more disjoint CSPs [3j. The 0-subsumption average cost 
reported for a given m-literal clause and T-constant example thus corresponds 
to a pessimistic (non-decomposable case) estimate. 

A second issue regards the uniform distribution of the 0-subsumption prob- 
lems considered. Each predicate symbol occurs once in the hypothesis, and 
= 100 times in the example. 

In real-world problems, some predicate symbols occur more frequently than oth- 
ers in the examples. Wrt the dual CSP, this means that constrained variables 
have domains with diverse sizes. Such a diversity makes CSP heuristics, e.g. con- 
straint propagation or dynamic variable ordering, more effective. In this respect, 
considering predicate symbols with same number of literals built on them leads 
to a pessimistic estimate of the average 0-subsumption cost. 

The third issue concerns the arity of the predicate symbols, which is restricted 
to 2 in our artificial setting. With respect to worst-case complexity, the predi- 
cate arity does not affect the dual CSP; the dual CSP size is 0(A^™), the size 
of dual domains exponentiated by the number of dual constraints, which does 
not depend on the arity. But the predicate arity dictates the number of dual 
constraints. Assume that a (primal) variable in hypothesis C occurs in o distinct 
literals in C; this is accounted for in the dual CSP by dual constraint^. 

Assuming that all predicates are fc-ary, their m x k arguments are selected 
among n primal variables. Assuming this selection is uniform (which is not as 
one has to ensure the clause connectivity), each variable intervenes on average 
in literals in C, and thus the total number of dual constraints is O ( ™ ). 

According to this preliminary analysis, increasing the predicate arity by a factor 
y/i can be likened to decreasing the number of variables by factor t. Experimen- 
tally, decreasing the number of variables causes the phase transition to move 
toward shorter hypotheses everything else being equal (left region in Fig. 1), 
with exponential decrease of the complexity peak [3] . 

® Note that if a primal variable occurs twice in a single literal p in C, this amounts to 
a dual unary constraint on the dual constrained variable Yp, directly accounted for 
by reducing the associated dual domain. 
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In summary, the artificial 0-subsumption problems considered were meant to 
study the worst average case with respect to decomposability and distributional 
diversity. 

6 Conclusion and Perspectives 

This paper presents a new 0-subsumption algorithm, Django, operating on a con- 
straint satisfaction-like representation of 0-subsumption. Django combines well- 
known CS heuristics (arc consistency, forward checking and dynamic variable 
ordering) with 0-subsumption-specific heuristics (signatures). Intensive experi- 
mental validation on artificial worst average instances show that Django outper- 
forms previous 0-subsumption algoritms |KTTR] by several orders of magnitude in 
computational cost. 

This computational gain might be a good news as ILP systems routinely 
perform thousands of subsumption tests. 

Even more interesting is the fact that the 0-subsumption complexity gives 
indications regarding the current situation of the ILP search, as located in the 
YES, PT or NO regions after the CSP framework | 7 ]. 

This might open several perspectives to ILP. 

On one hand, the relevance of the YES, NO and PT regions might be questioned 
in regard to real-world examples, whose distribution model could be arbitrar- 
ily different from the uniform model used in the artificial problems. How to 
characterize and exploit a generative model in order to refine and simplify the 
representation of an ILP problem, is investigated in the field of reformulation 
and abstraction (see m among others). 

On the other hand, it appears reasonable that, unless the target concept 
belongs to the YES region, relevant hypotheses lie in the PT region. This is due 
to the fact that most ILP learners prefer most general hypotheses provided that 
they are sufficiently correct (Occam’s Razor); therefore, no learner will engage in 
the NO region. In this perspective, new refinement operators directly searching 
the PT region would be most appreciated. 

Further research is first concerned with improving Django, checking whether 
other CS heuristics such as path-consistency are appropriate to 0-subsumption. 
In the same spirit, the CSP translation proposed for 0-subsumption will be 
extended to 0-reduction. The idea is that matching a clause with itself might 
give some information about the redundant literals. 

Another perspective is to use Django to compare alternative representations 
for an ILP problem, and select the representation with minimal 0-subsumption 
cost for randomly generated hypotheses. 

Last, an interesting question is whether and how the partial results of Django 
(values or variable links leading to most failures) can be used to navigate in the 
PT region, by repairing a clause into a clause with same complexity. 
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Abstract. Considering the difficulties inherent in the manual construc- 
tion of natural language parsers, we have designed and implemented 
our system Grind which is capable of learning a sequence of context- 
-dependent parsing actions from an arbitrary corpus containing labelled 
parse trees. To achieve this, Grind combines two established methods 
of machine learning: transformation-based learning (TBL) and inductive 
logic programming (ILP). Being trained and tested on corpus SUSANNE, 
Grind reaches the accuracy of 96 % and the recall of 68 %. 

Keywords: grammar induction, corpus-based parser construction, trans- 
formation-based learning, inductive logic programming 



1 Introduction 

Automated natural language understanding seems to be a very tempting issue, 
which has become an important area of research in recent decades. One of the 
means needed for its solution is to have a machine system - called parser - 
capable of translating natural language inputs into an internal representation 
suitable for subsequent computer manipulation. Although this translation is be- 
ing performed mainly on the level of syntax, practical experience reveals that the 
manual construction of natural language grammar (and consequently the manual 
development of parser as well) is a very time-consuming and error-prone task. 
To cover the complexity of natural language syntax with its enormous amount 
of irregularities, the appropriate grammar is usually being amended with new 
rules ‘ad hoc’. 

As a result of that practice, it is very difficult - even for educated com- 
putational linguists - to preserve consistency and adequacy of the grammar in 
the course of development and maintenance. Really significant improvements 
of the grammar, especially those involving conceptual changes, are preferably 
postponed due to their toughness and uncertain impact on the grammar’s per- 
formance. Shortly, there is a need for a system which would involve some level 
of automation during the process of natural language grammar (and parser) 
construction. 

* This research has been partially supported by the Czech Ministry of Education under 
the grant JD MSM 143300003. 
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The empirical alternative replaces hand-generated grammar rules with mod- 
els obtained automatically by training over language corpora. Corpus-based 
methods for natural language syntax acquisition are generally referred to as 
grammar induction. Basically, there has been presented a variety of approaches 
to grammar induction which vary primarily in (1) the form of required inputs 
and in (2) the form of expected outputs. 

1. required inputs: This criterion concerns the level of annotation present in 
training corpora. Although some systems have used raw (completely unanno- 
tated) text for grammar induction, those employing some kind of annotation 
(morphological or even syntactical) have produced more accurate parsers. 

2. expected outputs: This criterion concerns the level of analysis which the 
target system is expected to carry out. It also matters whether the outcome 
of the learning process should be a generative grammar (a system able to 
generate new sentences) or only a parser (a system designed just to analyse 
given sentences). 

In this contribution, we present a method for fully automated parser construc- 
tion from a treebank. It means that our system’s required input is a treebank (a 
corpus of syntactically annotated sentences) and its expected output is a parser 
(a tool for analysing constituent structure of given sentences). 

The paper is organised as follows: Section [2] provides a brief outline of our 
system Grind, Sections |3] and |4]detail the used learning techniques, in Section^ 
the experiments and their comparison with relevant works are presented, and, 
finally. Section summarises our results, highlights the main advantages of our 
approach and mentions some intended improvements. 

2 Outline of the System Grind 

Our approach to the task of automated parser construction has resulted in the 
design and implementation of the system Grind (Grammar Induction) which is 
capable of learning a sequence of context-dependent parsing actions from a given 
corpus containing labelled parse trees. To achieve this. Grind combines two 
established methods of machine learning: transformation-based learning (TBL) 
and inductive logic programming (ILP). 

During the TBL phase, a sequence of deepening operators (as we call them) 
is induced. The deepening operator is a pair {Sub, sym) consisting of a string of 
symbols Sub and a symbol sym. Each deepening operator is designed to recognise 
a particular constituent (e.g. a phrase, a clause, etc.) within a given sentence. It 
needs to be mentioned that the deepening operators themselves work regardless 
of the context. It means that if {Sub, sym) is the current deepening operator 
being applied and if there is the substring Sub in the current state of analysis, 
then Sub is considered as a new constituent which is consequently labelled with 
the symbol sym. Since Sub need not actually represent a constituent in every 
context, or Sub might sometimes represent a constituent with a different label, 
the uncontrolled application of deepening operators is very likely to produce 
inadequate analyses. 
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Provided the training corpus contains a complete syntactic analysis for each 
sentence, Grind is able to generate and distinguish automatically the parsing 
configurations in which a substring Sub should, or should not, be recognised as 
a constituent and labelled with a symbol sym. 

Thus, the ILP learning phase can exploit these positive and negative examples 
of the correct and incorrect application of various deepening operators to induce 
a set of forbidding predicates (as we call them). The forbidding predicate has 
got four arguments describing a particular parsing configuration (a left context, 
a questioned substring, a right context, and a suggested label) and makes a 
decision whether, in the given context, the questioned substring really represents 
a constituent with the suggested label. So, in effect, the forbidding predicates 
make the deepening operators context-dependent. For the task of inducing the 
forbidding predicates, we have tried two ILP systems: Alepi^ and Tildeo. 

In the operating mode, when parsing a fresh text. Grind takes tag sequences 
rather than string of words as the input. It means that the text needs to be pro- 
vided with unambiguous part-of-speech tagging before it is submitted to Grind. 

3 TBL Phase 

3.1 Application of Deepening Operators 

The transformation-based learning phase takes a set of training sentences ac- 
companied by their correct parse trees as the input. At each learning step, every 
training sentence is associated with an actual list of parse trees (we call it a treel- 
isf but an alternative term would be an ordered forest). In the beginning, each 
treelist consists of elementary trees only. The elementary trees have got just one 
leaf (represented by a particular word) and a root labelled with a part-of-speech 
tag for this word. Thus, initially, each treelist represents an unambiguously mor- 
phologically tagged sentence. Then, the deepening operators gradually transform 
all freelists in such a manner as to make those structures more conformable with 
the correct parse trees present in the training set. Finally, each treelist consists of 
just one parse tree which is identical with the corresponding correct parse tree. 
As the output the TBL phase produces a sequence of those deepening operators 
which were used for transforming all the initial treelists into the final ones. 

To make this idea clear, we show a demonstration run of the TBL phase. 
Our toy training set contains just one sentence with a parse tree taken from the 
corpus SUSANNE [ 8 |. The parse tree T is displayed in Figure [3 

Figure shows the initial treelist Tq associated with the training sentence 
as well as the gradually transformed treelists T\, T 2 and T 3 . In the treelist Ti, 
the overbraces indicate all currently applicable deepening operators E] What we 
obtain after their application is the treelist (T 4 is not shown here, but it 
corresponds to T). 

^ http : //web . comlab. ox. ac .uk/oucl/research/areas/machlearn/Aleph 
^ http: //www. cs .kuleuven. ac .be/~hendrik/Tilde/tilde .html 
® Applicable deepening operators are always determined by innermost parentheses. 
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S 




opposing 

Fig. 1. The correct parse tree T for the sentence “The opposing aircraft continued to 
come on. ” 



The resulting sequence of deepening operators attained during our demon- 
stration TBL phase is shown in Figure |2] The sequence comprises all applied 
operators in the order of their application. 

When parsing an unseen sentence in the operating mode, each member of 
the learned sequence is taken just onc^ and applied to the actual treelist, wher- 
ever it is possible. For example, the deepening operator (‘AT Tg NNc’, N) would 
take effect on all occurences of the substring of roots ‘AT Tg NNc’. Perhaps we 
should emphasise that our TBL setting is rather special since all transforma- 
tions performed by deepening operators are irreversible. Therefore, this is not 
the case that new transformation rules (i.e. deepening operators) added to the 
sequence can correct errors left by the previous ones. On this account, appli- 
cation of deepening operators must be controlled by forbidding predicates (see 
Section 0. 

Evidently, Grind employs bottom-up parsing scheme and its deepening op- 
erators resemble context-free grammar rules. However, their operational seman- 
tics is different from shift-reduce parsing: deepening operators work on treelists 
instead of a stack and there is just one applicable operator at each parsing con- 
figuration thus no backtracking is needed. 

Although the set of deepening operators is simply given by the training data, 
we need to determine their appropriate order so as to maximise the accuracy of 
resulting analyses. 



^ It means that some deepening operators can possibly have multiple occurences in 
the sequence. 

® Note that we mean the operational mode in which the sequence of operators is 
already learned. 
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Fig. 2. Gradual transformation of the initial treelist conducted by application of deep- 
ening operators. 



('VVGt', Vg) 
('VVDv', Vd) 
('TO VVOi', Vi) 

('RP', R) 



Ti 



(‘Vg', Tg) 
('Vi R', Ti) 



T 2 ^ ('AT Tg NNc', N) ^ Tg 
Tg ^ (‘N Vd Ti', S) ^ T 4 



Fig. 3. The resulting sequence of deepening operators. For the sake of clarity, they are 
grouped with respect to the transitions between particular treelists. 



3.2 Evaluation Function 

At each learning step there is generally a lot of applicable deepening operators. 
Therefore, Grind makes use of an evaluation function to decide which operator 
performs best in transforming treelists into the ones which are closer to the 
corresponding correct parse tree. Then the best scoring operator is selected - 
in accordance with the hill- climbing search strategy - and, at the same time, it 
becomes the next deepening operator in the sequence of operators being learned. 
This procedure is carried out repeatedly until all treelists are transformed into 
their final state (which corresponds to the correct parse tree) . 
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In order to define the evaluation function, we assign one of four possible 
categories (1 = exact match, 2 = partial match, 3 = redundant node, or 4 = 
crossing brackets) to each node from the current treelist T\ and similarly we 
assign one of two possible categories (5 = recognised node or 6 = unrecognised 
node) to each node from the correct parse tree T 2 . 

A node S Ti represents an exact match if there is a corresponding node 
rig G T 2 which has got exactly the same set of leaves in its subtree and also 
its label is the same. A node rip G Ti represents a partial match if there is a 
corresponding node n'^ G T 2 with the same set of leaves, but the labels of nodes 
Up and n'p differ. A node n^- G Ti is called redundant if it does not violate the 
phrase nesting, but there is no corresponding node G T 2 spanning the same 
portion of given sentence. All the remaining nodes rig G T\ which cannot be 
classified by previous categories give rise to crossing brackets. On the other hand, 
a node rrig G T 2 was recognised if there is a node mj, G T\ anchoring a subtree 
with the same set of leaves. (Node labels do not matter in this case.) If the node 
m-u G T 2 was not recognised, then we add it into the category unrecognised. 
Every occurence of an exact match or a recognised node is favourable whereas 
the other categories decrease accuracy of the obtained parse tree. 

The value of the deepening operator Oper = {Sub, sym) is determined as 
follows. For each treelist containing the substring of roots Sub, the count of 
nodes in each cat^ory would increase (possibly by zero) if the operator Oper 
were applied there[j Grind computes the difference in the count of nodes for each 
category and adds up these numbers over all treelists. Thus, let Di be the total 
difference for the category i. Further, each category has got its own weight (a 
real number). If a weight Wi is assigned to the category i, then the value V{Oper) 
of the operator Oper can be expressed by the formula V{Oper) = X)i=i DiWi . 

In the current implementation, the weights are a priori preset to fixed values 
due to relevance of the corresponding category. The favourable categories have 
got positive weights whereas the others have got negative ones. 

4 ILP Phase 

4.1 Meaning of Forbidding Predicates 

Each positive (negative) example generated during the previous TBL phase 
describes an instance of an incorrect (a correctj^ application of a particular 
deepening operator. Both positive and negative examples have got the form 
forbid(Tag, Lctx, Phrase, Rctx) where Tag is a suggested label for the new 
node which is to be constructed. Phrase is a treelist whose elements (subtrees) 
would be attached to the new node so as to form a new subtree and Lctx and 
Rctx are treelists representing the left and the right contexts, respectively. The 
head of every learned rule has got the same form as the learning examples. 

® The counts of recognised and unrecognised nodes can even decrease. 

^ Indeed, the positive examples represent inappropriate configurations for the given 
operator (and vice versa) because the forbidding predicate is intended to succeed in 
the case when the particular application should be discarded. 
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As mentioned above in Section forbidding predicates are used to control 
application of deepening operators in operating mode. For example, if the deep- 
ening operator (‘NNlc P’, Ns) is to be applied on a particular treelist, then the 
substring of roots ‘NNlc P’ can have multiple occurrences there. Every such 
occurrence represents a ground instantiation of the goal 

?- forbidC’N’, Lctx, [treeC’NNlc’ ,LN1) , treeC’P’ ,LN2)] , Rctx) . 

where the variables LN 1 and LN2 are bound to lists of subtrees whose roots have 
the ancestors NNlc and P, respectively. 

If this ground goal succeeds - which depends on the learned theory the 
corresponding deepening operator is discarded in the given context. 

4.2 Background Knowledge 

Our background knowledge consists of predicates which are designed to explore 
the structure of given freelists. Hereafter, we will obey the notation used in 
Aleph: input variables are preceded by the sign ' + ’ and constants (ground 
terms) are marked by the sign ‘ . Assuming that the mode and type declaration 
of the target predicate is 

forbid(+tag,+treelist ,+treelist ,+treelist) , 
we use the following background predicates: 

— tag(+tag,#tag) simply checks the suggested tag of proposed node. 

— roots(+treelist,#scope,#pattern) succeeds, if in the given treelist there 
is a sublist of roots which matches the pattern of symbols. The pattern may 
include wildcards. Moreover, the search within pattern is narrowed by the 
term scope. 

— leaves (+treelist,#scope,#pattern) acts similarly to the previous predi- 
cate, but it tries to match the given pattern against a sublist of leaves (rather 
than roots) from the given treelist. The term scope has an analogical meaning 
here. 

— path(+treelist,#scope,#pattern) succeeds, if there is a path evolving 
from some root in the given treelist and the symbols from this path match 
the pattern. 

— node_succ(+treelist,#scope,#tag,#pattern) looks for a node (some- 
where in the given treelist) which is labelled with the tag and the list of 
its successors contains a sublist matching the pattern. 

— empty_ctx(+treelist) succeeds only when the given treelist is empty. 

To make the idea of our background knowledge more transparent, we present 
a small demonstration. For example, the induced rule in Figure 2] succeeds if 
all the three following conditions hold: the first tree on the left has got the 
leaf AT, somewhere within the sublist Phrase there is a tree with the root P, 
and the right context is empty. As demonstrated on the treelist in Figure 2] 
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forbid(Tag, Lctx, Phrase, Rctx) 
leaves(Lctx, first(l), [’AT’]), 
roots (Phrase , somewhere, [’P’]), 
empty_ctx(Rctx) . 



Ns 



( Vg 


(AT 


NNlc) P) 


1 

WGv 


1 

the 


door lf""7p)H01m 


1 

closing 




1 1 

behind him 



Fig. 4. An example of a forbidding rule and a configuration in which this rule succeeds, 
thus discarding the application of deepening operator (‘NNlc P', Ns). 



Tag = 


’Ns’ 


Lctx = 


[tree (’AT’ , []) , 




tree(’Vg’ , [tree(’VVGv’ , [])])] 


Phrase = 


[tree(’NNlc’ , [] ) , 




tree ( ’P ’ , [tree ( ’ II ’ , [] ) , tree( 


Rctx = 


[] 



Fig. 5. A full instantiation of the goal forbid(Tag, Lctx, Phrase, Rctx) in the con- 
figuration displayed above. Note that the treelist Lctx is reversed and the forbidding 
rule considers pre-terminals, instead of word forms, as leaves. 



in this case the forbidding rule discards the application of deepening operator 
(‘NNlc P’, Ns). Figure [3 explicitly shows the corresponding instantiation of the 
goal forbid (Tag, Lctx, Phrase, Rctx). 

From the viewpoint of ILP, the main asset of TBL phase follows from the 
fact that the parser construction problem was reduced to a classification task. 
Intuitively, a direct induction of deepening operators by means of ILP alone 
would imply higher computational requirements. 

Two ILP systems were tested for the task of inducing the forbidding predi- 
cates: Aleph and Tilde. Aleph is an ILP system that supersedes P-Progol 
and its inductive algorithm is based on a technique called mode-directed inverse 
entailment (?]• Aleph employs top-down search and covering strategy to propose 
hypotheses (sets of clauses) from examples. Tilde is an ILP system that learns 
hypotheses in the form of first-order logical decision trees [T]. The decision trees, 
induced in top-down manner from a set of pre-classified examples, can be used 
to classify unseen examples. Unlike Aleph, Tilde follows divide-and-conquer 
strategy. 

The reason for having chosen these two particular systems was that we had 
not to develop two completely different implementations of background knowl- 
edge predicates. The necessary modifications were mainly of syntactic nature. 
Moreover, both the systems can introduce constants into the body of a clause, 
which is an especially useful feature for our purpose. 
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5 Experiments and Results 

5.1 Training and Testing Data 

The SUSANNE |H| corpus comprises a subset of the Brown Corpus of American 
English annotated in accordance with the SUSANNE scheme. We randomly 
chose 500 training and 500 testing trees from this corpus. The training set and 
the test set were disjoint. Sentences from the test set were from 2 to 20 words 
long and the average length was 11.8. The words within sentences were replaced 
with their corresponding tags. 

5.2 Evaluation Criteria 

In order to measure the performance of our system, we have used evaluation 
criteria according to Grammar Evaluation Interest Group scheme |4]: overall 
accuracy and recall. If C {Correct) is the number of those nodes in the obtained 
parse tree which do not give rise to crossing brackets and To ( Total Obtained) is 
the total number of nodes in this tree, then the overall accuracy is an average 
fraction C/Tq of the non-crossing nodes per test sentence. On the other hand, if 
R {Recognised) is the number if recognised nodes in the correct parse tree and 
To ( Total Correct) is the total number of nodes in this tree, then the recall is an 
average fraction R/Tc of the recognised nodes per test sentence. 

5.3 Accuracy and Recall 

At first, we measured overall accuracy and recall of Grind in relation to the 
cardinality of the training set. To achieve this, we were selecting gradually larger 
subsets from the original training set, starting with 50 trees and adding next 
50 trees step by step. It means that a smaller training set was always included 
in a larger one. Performance parameters were measured on the same test set in 
each case. Figures 0 and 0 show learning curves for overall accuracy and recall, 
respectively. As for this part of evaluation, we present only the results achieved 
with Aleph since the performance of Tilde was very similar. In both figures, the 
solid line with crosses represents performance of the system without ILP learning 
phase employed. In other words, the solid line describes results with no forbidding 
predicates used. On the other hand, the dashdot line with circles represents 
performance parameters achieved when application of deepening operators was 
guarded by forbidding predicates. 

Another important criterion regarding accuracy is the ratio of sentences from 
the test set which were analysed with no crossing brackets or with, at most, a 
small number of them. As for the recall, we observed the ratio of sentences which 
were analysed with all constituents recognised or with, at most, a small number 
of them missing. Tables [T] and E] show these performance parameters obtained 
on the training set with 500 trees. 

Finally, we carried out an experiment which regards the richness of used 
tagset. As mentioned above. Grind parses tag sequences. It means that given 
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Accuracy 




Fig. 6. Learning curves for overall accuracy. At first, the accuracy decreases due to the 
increase of recall: more nodes are recognised, thus more of them give rise to crossing 
brackets. However, from the count of 300 training trees and more, the accuracy with 
ILP (Aleph) rises up to 95.9%, while the curve for accuracy without ILP still falls. 



Recall 




Fig. 7. Learning curves for recall. Initially, with the small number of training trees the 
recall grows rapidly, but then its growth slowly diminishes. We can see that application 
of ILP (Aleph) always slightly increases the recall, so that the curve for recall with 
ILP continually rises up to 68.2 %. 



an input sentence, some preliminary work outside the Grind must be done to 
assign a part-of-speech tag to each individual word. The more ambiguities need 
to be solved, the more demanding this preliminary work could possibly be. So 
the ideal situation would be if there were just one possible tag for each word. 
On the other hand, a richer tagging, if disambiguated, would convey more of 
syntactic information useful for the parsing task. Bearing this fact in mind, we 
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Table 1. Accuracy in terms of percentage of those trees which had only a small number 
of crossing brackets. The training set contained 500 trees. 



Accuracy 


Number of 


Percentage of 


crossing 


corresponding sentences 


brackets 


Aleph 


Tilde 


= 0 


73.8 % 


70.8 % 


< 1 


81.8% 


85.5 % 


< 2 


94.5 % 


94.4 % 


< 3 


97.2 % 


98.3 % 


< 4 


99.3 % 


99.3 % 


< 5 


99.8 % 


99.9 % 



Table 2. Recall in terms of percentage of those trees which had only a small number 
of missing nodes. The training set contained 500 trees. 



Recall 


Number of 


Percentage of 


missing 


corresponding sentences 


nodes 


Aleph 


Tilde 


= 0 


16.2% 


16.2% 


< 1 


30.5 % 


30.9 % 


< 2 


46.5 % 


47.8 % 


< 3 


62.2% 


64.5% 


< 4 


75.5 % 


78.2% 


< 5 


85.7% 


87.7% 



tried to prune the original SUSANNE tagset to see how Grind would perform 
if it were provided with less information due to a smaller tagset with a lower 
number of possible tags per word. 

Initially, the training trees together with the test sentences contained 477 dis- 
tinct tags. Table |3]shows complete results which we have achieved on 500 training 
trees when using this original tagset. To prune it, all tags were gradually trun- 
cated in two steps. For the first time, the maximum length of tag was set to 
three characters and all exceeding tags were shortened to this length. In this 
way, we obtained a tagset which had only 316 distinct tags (for corresponding 
results see Table E} . For the second time, all tags were further shortened to the 
maximum length of two characters. Then, the tagset shrunk to 130 distinct tags 
(resulting performance parameters are in Table EJ. 

Tables E] E and 0 should demonstrate that the application of ILP systems 
managed to compensate the partial loss of information in training data. With- 
out ILP, the important parameters (namely the overall accuracy) got noticeably 
worse, whereas after the employment of ILP the performance stayed at reason- 
able level. Tables [3 E and E present also the ratio of exact matches, partial 
matches and redundant nodes. Using the pruned tagset. Grind starts to make 
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Table 3. Comparison of the results achieved on 500 training trees, using the original 
tagset which contained 477 distinct tags. 



1 Original tagset - 477 distinct tags 




without ILP 


with Aleph 


with Tilde 


Overall Accuracy 


92.9% 


95.9% 


94.5 % 


Recall 


67.0 % 


68.2 % 


69.1% 


Exact Match 


55.2% 


69.4% 


68.2 % 


Partial Match 


1.3% 


1.2% 


1.1% 


Redundant Nodes 


36.4% 


25.3% 


25.2 % 


Crossing Brackets 


7.1% 


4.1% 


5.5% 



Table 4. The results achieved on 500 training trees, using a tagset whose tags were 
truncated to 3 characters - it yielded 316 distinct tags. 



1 All tags truncated to 3 characters - 316 distinct tags 




without ILP 


with Aleph 


with Tilde 


Overall Accuracy 


91.1% 


94.6% 


94.4% 


Recall 


66.5% 


68.7% 


67.6 % 


Exact Match 


52.0% 


64.6% 


66.1% 


Partial Match 


5.0% 


6.2% 


6.0% 


Redundant Nodes 


34.1% 


23.8% 


22.1% 


Crossing Brackets 


8.9% 


5.4% 


5.6% 



Table 5. The results achieved on 500 training trees, using a tagset whose tags were 
truncated to 2 characters - it yielded 130 distinct tags. 



All tags truncated to 2 characters - 130 distinct tags I 




without ILP 


with Aleph 


with Tilde 


Overall Accuracy 


88.6 % 


93.1% 


92.6% 


Recall 


69.8 % 


69.0 % 


68.1% 


Exact Match 


36.4 % 


48.4% 


49.0 % 


Partial Match 


16.7% 


22.8 % 


21.8% 


Redundant Nodes 


35.5 % 


21.9% 


21.8% 


Crossing Brackets 


11.4% 


6.9% 


7.4% 



more errors in determining the correct label for nodes (the exact match goes 
down in contrast to the partial match) . Nevertheless, the frequency of the most 
serious error (the crossing brackets) does not rise too much. 

From the tables and figures mentioned above we can see that forbidding predi- 
cates induced by ILP improve essentially the parsing accuracy of Grind, whereas 
the increase of recall after their application is not very high. Both Aleph and 
Tilde performance parameters seem to be surprisingly similar. This is probably 
due to the fact that both systems used the same background knowledge. 
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5.4 Comparison with Relevant Works 

Zelle and Mooney [Hj tested their system Chill [B] on a portion of AXIS corpus. 
One of their reported experiments had a very similar setting to ours since Chill 
was adjusted to parse tag sequences. The authors also used similar performance 
criteria, but the terminology slightly differs: their consistent brackets accuracy 
refers to our overall accuracy and their zero crossing brackets accuracy refers 
to our percentage of sentences with no crossing brackets. After training on 525 
sentences, Chill constructed a parser which achieved 90 % of consistent brack- 
ets (Grind 95.9%) and 64% of zero crossing brackets (Grind 73.8%). How- 
ever, these figures should be handled with care, since the parser constructed by 
Ghill was designed to produce binary-branching tree^ while Grind normally 
generates n-ary structures. It is easy to see that a tree containing a single, flat 
constituent covering the entire sentence always yields a perfect crossing score. 

Therefore we tried to compare the two systems also in terms of recall. Zelle 
and Mooney used a partial match accuracy which does not agree with our defini- 
tion of partial match, but its basic idea is related to our notion of node pairing. 
In (^, two constituents are said to match if they span exactly the same words in 
the sentence. If constituents match and have got the same label, then they are 
identical. The overlap between the obtained tree and the correct tree is computed 
by trying to match each constituent of the obtained tree with a constituent in 
the correct tree. If an identical constituent is found, the score is 1.0. A match- 
ing constituent with an incorrect label scores 0.5. The sum of the scores for all 
constituents is the overlap score O. The partial match accuracy for a sentence 
is consequently computed as (p + 0) /2 where F and C are the numbers of 
constituents in the obtained tree and in the correct tree, respectively. We have 
found out that the partial match accuracy of Grind is 81.3%, which is fairly 
comparable with Zelle and Mooney’s test (they report 84%). 

The performance of Grind can also be compared with the system of Brill mi- 
ne ran experiments on the Wall Street Journal corpus. After training on 500 sen- 
tences with the average sentence length 10.8, his transformation-based parser 
achieved a bracketing accuracy (which coincides with our overall accuracy) of 
89.3% (Grind 95.9%). As reported in |2], the percentages of sentences in the 
test set which had no crossing constituents, or one at most, or two at most, 
were 53.7%, 72.3%, and 84.6%, respectively. Corresponding Grind’s results 
are 73.8%, 81.8%, and 94.5%. However, it should be noted that Brill’s parser 
always produces a binary-branching tree. 

Unfortunately, we cannot confront our results with the system Grids of 
Langley [Q, although his work was a primary inspiration for us. The reason is 
that he ran experiments on unrealistic, toy data only. Furthermore, he does not 
care about descriptive adequacy of the induced grammar and reports only results 
concerning the probability of accepting a legal sentence and the probability of 
generating a legal sentence while our system is not designed to work as an 
acceptor or a generator. 



But if the parser runs into a dead-end while parsing a test sentence, then the returned 
tree needn’t be binary. 
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6 Conclusions and Future Work 

We have presented a new approach to automated parser construction which 
is based on combination of two techniques already used: transformation-based 
learning and inductive logic programming. Our system Grind achieved relatively 
better accuracy rate in comparison with related systems, despite the slightly 
lower recall. In this respect, Grind follows the voice of many linguists which 
claim that NLP systems should preferably “do less than make errors”. More- 
over, our parsing scheme is easy to understand and therefore it represents a 
suitable starting point for further development of a system which would involve 
a human-machine co-operation during the process of natural language parser 
construction. 

In the future, we intend to improve Grind in these points: to replace the 
hill-climhing strategy in TBL phase with the beam search, to invent a technique 
for automatic adjustment of the values for weights rather than set them fixed, 
and to make the system towards human-machine co-operation. 
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Abstract. In the line of previous work by S. Muggleton and C. Sakama, 
we extend the logical characterization of inductive logic programming, 
to normal logic programs under the stable models semantics. A logic 
program in this non-monotonic semantics can be contradictory or can 
have one or several models. We provide a complete characterization on 
the hypotheses solution to induction of this kind of programs. 



1 Introduction 

Consider the following motivating example. 

Example 1 . Given normal logic program B 

p ^ not q 
q ^ not p 

assume we want an extension of this program, B U H, for the atoms of the set 
E = {p} to be consequence of the extension, B U H \= E. Note that B has two 
stable models {p} and {g}. Thus there are not literals consequence of B. □ 

Consider the following solutions: 

Hi = {p f-}, H2 = {p<r- q}, H3 = {p^ notp}, H4 = {p -h- notp,q}. 

Solution iLi is directly a fact about the wanted atom and it can be induced 
by current ILP methods. Solutions H3 and El^ contain rules with negation as 
failure, thus only non-monotonic ILP could induce them. 

But in fact and H4 are not inducible using current NM-ILP methods for 
the following reasons. 

• The program B has several stable models, and, in particular, no atomic 
consequences. Thus the enlarged bottom set of E-IE [S], and the expansion set 
M+ of NM-IE are empty. Then there is no set from which to get possible 
candidates for body literals of the hypothesis E[. 

• The literal p appears both in the body and in the head of iL, that is not 
allowed in these methods, furthermore this makes H3 alone inconsistent. 

The most interesting case is solution H2 because it is a positive rule, thus 
current ILP methods could be able to induce it. But it is not the case because 
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IE is only defined for Horn logic programs and this is not the case of B in the 
example. In other words, q is not entailed by B so it cannot be in the body of 
H . This last fact is also the reason why H2 is not discovered by NM-IE (despite 
NM-IE is defined for non-Horn programs). 

Example 2 . Consider another normal logic program B 

q ^ not p 
q 1— not q 

we want to learn E = {p}. Program B has only one stable model, {9}. Thus the 
literals consequence of B are {q^notp}. □ 

Consider the following tentative solutions: 

Hi = {p ^}, H2 = {p^ q}, i?3 = {p ^ notp}, H4 = {p ^ notp,q}. 

Hypothesis Hi is not a solution, it makes the program B U Hi contradictory, 
i.e. there is no stable model. In fact none of the four tentative hypotheses is a 
solution, because for any of them B U Hi is contradictory. 

These are all the hypotheses built from the literals consequence of B. Thus is 
there no solution? No, actually there are solutions. Consider H5 = {p q ^ p}. 
The program B U H5 ^ E and it is not contradictory, the unique stable model 
is {p, q}. 

Inverse Entailment (IE) jl] and Enlarged Inverse Entailment (E-IE) by Mug- 
gleton [S|, and Non-monotonic IE by Sakama [H], rely on the set of literals con- 
sequence of B to define the hypothesis H solution to induction. 

When extending induction to nonmonotonic LP, the background knowledge 
B — containing negative literals in the rules — is no longer representable by the 
set of literals consequence of it. In Example [21 an alternative B = {q-<r- notp}, 
with the same set of consequences, {q,notp}, would accept Hi as solution. 

There is another extension made to the basic setting of ILP in these examples, 
namely, the predicate of the examples can be already present in the rules of B. 

When this extension of B is considered, the contribution of the background 
knowledge to the learning task is not only to provide facts about other predicates 
on which the induced rules can rely. But also to act as a constraint background 
knowledge that can forbid some solutions. 

This fact is also present in basic ILP, e.g. the (extended) background knowl- 
edge can already entail one negative example, making induction impossible. 

Part of this study is centered in the identification of the constraint effect of 
the background. 

Finally there is another effect when H is a normal program: some conse- 
quences of B may not be preserved after induction. Consider p •<— not q the 
consequences are {p}, we want to learn {g} then H = {q •«— } is a solution, but 
now the consequences oi BU H are {g}, i.e. p is no longer a consequence. These 
are ‘nonmonotonic’ consequences of B, i.e. consequences that relied on default 
assumptions about an atom being false. 
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In the next section we recall the definition of stable models. Then induc- 
tion of stable models is characterized. We conclude discussing the results and 
commenting on related work. 

2 Normal Logic Programs and Stable Models 

A (ground) normal logic program is a set of rules of the form 

Aq ^ Ai, . . . ; Tiot A^_|_i, . . . ; Tiot A^ (1) 

where n > m > 0, and each A^ is a ground atom. If a rule or a program does 
not contain the not operator it is called positive. 

The stable model semantics of a normal logic program is defined in two 
steps. First let B be a positive program, then the stable models are the minimal 
sets of atoms M that satisfy the condition: For each rule 

Aq ^ Ai, . . . , Am 

from B, if Ai G M, for alH : 1, . . . , m, then Aq G M. 

Now let i? be a general ground program. For any set M of atoms, let be 
the program obtained from B by deleting 

1 . each rule that has a formula not A in its body with A G M, and 

2. all formulas of the form not A in the bodies of the remaining rules. 

The program B^ is positive; if M is the stable model of this program then M 
is a stable model of B. 

Note that, by the definition, for positive programs the stable model is unique 
and coincides with the least Herbrand model of the program. This result also 
holds for Horn programs, as shown in 0, when the program is not contradictory; 
otherwise there is no least Herbrand model, nor stable model. Even for normal 
programs, when the least Herbrand model exists, the stable model coincides with 
it (and is unique), e.g. stratified normal programs. The difference is for the other 
normal programs, for which there can be no stable model, one or several stable 
models, e.g. {p notp, notq} does not have stable model; {p ^ notq, q ■(— notp} 
has two stable models, {p} and {g}. 

A program is not contradictory iff it has one or more stable models. When 
there are several stable models, the atoms consequence of the program are the 
atoms common to all the stable models. 

Stable models (and least Herbrand models) are minimal models; to further 
differentiate a stable model from (just) a model, i.e. not necessary minimal, we 
will call the latter a monotonic model of the program. 

A set of atoms is a monotonic model of a rule © iff whenever M satisfies 
the body, Ai G M, for all i : 1, . . . , m, and Aj ^ M, for each j : m + 1, . ■ ■ ,n, 
then M satisfies the head, Aq G M. 

A set of atoms M is a monotonic model of a program iff it is a monotonic 
model of each rule of the program. 

It is easy to verify that every stable model is a monotonic model of the 
program. 
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3 Characterization of Induction in Stable Models 

In this section we propose necessary and sufficient conditions for the existence of 
solution to induction of normal logic programs under stable models semantics. 
We do the characterization in three steps. First for induction from a complete 
set of examples, then induction in the usual ILP setting for which the set of 
examples is not complete. Finally induction from several sets of examples — a 
new ILP setting that is relevant in stable models programming. 



3.1 Induction from Complete Sets 

Consider the particular ILP setting in which the set of examples is complete, i.e. 
for every ground literal of the program there is either a positive example on it or 
a negative example. In this case, the set of examples corresponds to one model 
of the program. 

In this setting, the task of finding an extension of B that entails the set 
of examples is an application of the representation theorem for LP. Instead of 
finding a program that has a particular set of facts as consequence, it is to find 
an extension of a given program B that has the particular set as consequence. 
(Without B there is a simple solution, viz., a set of fact rules, one for each 
positive example.) But with B — as mentioned before — there can be no solution, 
e.g. when B already entails one of the negative examples. 

When B is a, normal program its behavior as a constraint on the solutions 
is stronger. As show in Example |2]in the introduction, even a program B that 
does not entail negative examples, does not accept the simple solution of a set 
of facts on the positive examples. 

Theorem 1. (Existence of solution, necessary condition) Given a normal 
logic program B, and a possible model M , there is no extension H of B, such 
that M is a stable model of B U H if M is not a monotonic model of B. 

Proof. Every stable model of a program is a monotonic model of it. The addition 
of more formulas H to a given program B only deletes monotonic models of B 
(they have to satisfy H also.) Thus if M is not a monotonic model of B the 
effect of adding more rules will not recover M as monotonic model. □ 

Note that if we change monotonic model by stable model in the previous 
proof, the facts do not hold, we would be in a nonmonotonic formalism. 

Given a complete set of examples E = {oi, . . . , a„, not bi, . . . , not bm}, it di- 
rectly corresponds to one possible model M denoted {oi, . . . , a„} of the program 
B. Thus if there is a solution H , BGH \= E, then the set E considered as model, 
M, is a monotonic model of B. 

Next we will show that the converse of Theorem [T] also holds, providing a 
complete characterization on the existence of solution to induction problems for 
a complete set of examples, under the stable model semantics. 
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Theorem 2. (Existence of solution, sufficient condition) Given a normal 
logic program B, and a possible model M , there is an extension H of B, such 
that M is an stable model of B U H if M is a monotonic model of B. 

Proof. We will construct an H that is solution. Consider H = {oi ^ ja^ G M} 
a set of fact rules corresponding to each of the positive atoms S M. Note 
that, by construction, M is a monotonic model of H. (Each of the rules Oi t— 
constructed from M is satisfied by it.) As M is a monotonic model of B, then 
it is a monotonic model of B U H. 

Then we will verify that it is stable. Consider the reduct {B U H)^ = B^ U 
= B^ UH because H is a positive program. M is a stable model iff it is the 
minimal model of the positive program B^^ U i/. As M is a monotonic model 
of B it is a monotonic model of B^ (the reduct B^ has a subset of the rules 
of B, and for the remaining rules the negative literals deleted are satisfied by 
M). Assume M is not minimal, then there is another (monotonic) model M' of 
B^ U H such that M' C M, then there is one atom, assume it is Ok, a* G M, 
Ok ^ M' . By construction of H there is one fact rule Then M' is not a 

monotonic model of this fact rule, thus it is not a monotonic model of B^ U H. 
□ 



In fact, there are other H solutions when M is a monotonic model of B. But 
the H formed with facts is always a solution. 

Example 2 (cont.) TheoremElmay seem surprising if we recall Example El in the 
introduction. In that example it is shown that the simple H formed with facts is 
not a solution, and, nevertheless, there are other solutions. The key observation 
is that {p} in Example El is not a monotonic model of B. Thus this model, that 
entails the example set E = {p} is not valid. The solution shown corresponds to 
the model {p, g}; and this do is a monotonic model of B. 

There is a solution in Example El because E is not considered a complete set 
(in the sense that not q is not in the set). (If the case were that the example set 
{p} is complete, i.e. {p,not q}, then by the previous result we would conclude 
that there is no solution.) □ 

The previous characterization of the existence of solution for induction of sta- 
ble models, needs to detect whether a model is a monotonic model of a program 
or not. 

Theoretically a model is a monotonic model of a program iff the model satis- 
fies the program. Thus the model corresponding to the complete set of examples 
can be tested for satisfiability, constituting an implementation of induction in 
stable models. 

Alternatively, the following result can be used to verify that a set is a 
monotonic model. We will use the same name to denote a set of atoms 
M = {tti, . . . , o„}, and a set of fact rules on the atoms of the set, M = {oi G- 
, . . . , a„ ■<— }. 

Proposition 1. Given a normal logic program B, M is monotonic model of B 
iff M is a stable model of B U M . 



198 



Ramon P. Otero 



Proof. Consider the program B U M. If M is a stable model of it then it is a 
monotonic model of B U M. Thus it has to be a monotonic model of both B and 
M. 

Proof in the other direction is similar to that of Theorem □ 

Proposition [T] identifies monotonic models M with those that verify they are 
stable models of S U M. In fact, in the conditions of Proposition [H M is the 
unique stable model of BUM. (Any other stable model M' of BUM, will satisfy 
M, thus M C M' and then M' is not stable.) 

System smodels by Niemela et al. [6] is a sound and complete implementation 
to find the stable models of a normal program. 

The system smodels can be used to induce normal programs under the stable 
models semantics for the setting of complete sets of examples. Just consider the 
example set E as a possible model M, try program BUM, if M is a stable model 
of it, then there is a solution to induction, e.g. H = M is a solution. If M is not 
a stable model of it, then there is no solution. 

Example 1 and 2 (cont.) Recall Example[I]in the introduction. Consider BUE = 
{p ■(— not q, q <— not p, p ■(—} the stable model is {p}. Then H = {p is a 
solution. 

For Example [2 consider B U E = {q not q, q ^ notp, p ^}, there is no 
stable model. Then there is no solution. 

Instead consider the examples set E' = {p, q}, BUE' has E' as stable model, 
then H = {p q is a solution. □ 



3.2 Induction from Non-complete Sets 

Consider that the set of examples is not complete — the usual ILP setting. The 
definition of solution to an induction problem is as follows. 

Given the two parts of the set of examples E = E~^UE~ , i.e. the positive ex- 
amples E^ and the negative examples E~ , there is a solution H , in the presence 
of background knowledge B,iS. BU El \= E^ , BU H ^ E~ , and B U H ^ E. 

Definition 1 (Complete extension) Given a set of examples E = E+ U E~ , 
an interpretation M (of B U E) is a complete extension of E iff E~^ C M and 
MnE- = 0. □ 

The following result identifies the existence of solution using the results for 
the case of complete set of examples. 

Theorem 3. (Existence of solution) Given a normal logic program B, and 
a set of examples E = E~^ U E~ there is a solution H to induction iff there is 
(at least) one complete extension M of E that is a monotonic model of B. 

Proof. If there is a solution then the stable model M oi BU H exists (because 
BU H ^ T), thus it is a monotonic model of B. Furthermore, M is a complete 
extension of E, because BUH \= E+ thus M |= E+ (E+ C M), and BUH fy E~ , 
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thus M ^ E {M r\ E =0). (Note that the last relation means that M Y= e^- 
for every e~ S E~ .) 

If there is a monotonic model M of B that is a complete extension of E, 
then by Theorem there is an extension H = M such that M is stable model 
of B U H. Then U i? ^ T. As M is a complete extension of if, M |= if+ and 
M ^ E~ . Thus B\J E[ \= if+ and ii U ii ^ E~ . (Recall that, in fact, M is the 
unique stable model of R U M .) □ 

For an alternative characterization as we did for the complete set case, we can 
recall also those results here. Then from the previous theorem and Proposition [T] 
we get the following. 

Corollary 1. Given a normal logic program B, and a set of examples E = 
E~^ U E~ there is a solution El to induction iff there is (at least) one complete 
extension M of E that is a stable model of B U M . □ 

^From an implementation point of view now it is needed to search the ex- 
tensions of the set of examples for a complete set that is stable of itself added 
to B. 

The direct implementation is to call several times the system smodels with B 
and one of the complete extensions until one of them has itself as stable model. 
(If none of them are its own stable model then there is no solution.) 

Note that in this setting we have a choice on the possible solutions. Several 
extensions of the set of examples can have solution. This is the usual choice in 
induction from the most specific solution to the most general solution. 

In this setting, the search can be reduced using the following result. 

Proposition 2. Given a normal logic program B, M' is monotonic model of B 
if M' is a stable model of B LI Ad and Al C M' . 

Proof. Consider we add some atoms of M', Al C Ad' to B, and Ad' is a stable 
model of R U Ad. Then Ad' is a monotonic model of R U Ad, thus it is also a 
monotonic model of R. □ 

Consider a set of examples E = E~^ U E~ . We extract from E the subset of 
positive examples M = E’'*'. Sometimes BUAd does not have Ad as stable model 
but some superset Ad' as stable model. Then if Ad' is a complete extension of E, 
there is solution to induction. 

Even when BU Ad does not have stable model at all, there can be a solution 
to induction. The situation in nonmonotonic induction is that B U E+ can be 
contradictory and still an extension of E+ provide a consistent extension for B. 
Thus we have to search for a consistent extension of B, and among all these 
extensions we can choose following particular generalization criteria. 

Note that these results not only characterize the existence of solution to 
induction, but every H solution to induction. There is a particular H solution 
iff the model of E U E is a monotonic model of B and a complete extension of 
E. 

In summary, there are three kinds of solutions in this setting of normal logic 
programs. 
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• Minimally extended. The H that are facts on the positive examples minimally 
extended to avoid contradiction with B. 

• Generalizations. The H that are generalizations of the minimal ones, thus 
implying more atoms. These solutions can be constructed by just adding 
more fact rules to the minimal ones, or — as usual in ILP — by first-order 
generalization. But not every extension is solution, these extended H have 
to verify the conditions of the characterization, in particular, the monotonic 
model condition (appart from the usual condition on complete extension). 

• Nonmonotonic. There is a new kind of H solution in this setting (also present 
in NM-IE m) , viz., H that use negation as failure, let us call them nonmono- 
tonic hypotheses. 

The nonmonotonic hypotheses do not really constitute solutions more specific 
than the minimally extended ones. 

Furthermore, these nonmonotonic solutions have a property that is usually 
non-intended: the examples learned are not necessarily preserved after further 
induction (induction in several steps, or multiple predicate learning). The non- 
monotonic behavior of B U i? is stronger than the one with minimally extended 
or generalized hypotheses, because part of the examples entailed hy BUH might 
rely on default assumptions. 

Consider that we want to further extend BUH with H' to cover additional 
examples. If the task is performed by considering B' — B U H and applying 
the basic procedure to arrive to B' U H' , then some of the previous examples 
covered by H can become uncovered after the addition of H' . (The coverage 
of the previous examples has been done nonmonotonically, thus they are not 
necessarily entailed after any addition of more rules to the program.) On the 
other hand, any H composed of fact rules, entails the examples monotonically, 
thus this situation cannot arise. (What can happen is the alternative situation, 
that some of the negative examples are covered after the addition of H\ but this 
is a well known fact already in ILP for Horn theories.) 



Induction from Non-complete Sets under Background Horn Theories. 

So far we have shown that solutions composed of a collection of fact rules char- 
acterize the existence of solution in ILP for normal logic programs. 

The extension made here only points out that precisely the set of facts from 
the positive examples does not need to be such a solution. Nevertheless, there is 
a restriction that precisely characterizes the existence of solution with the set of 
facts from the positive examples. 

Consider that the background knowledge is a Horn theory, i.e. definite clauses 
and goal clauses (constraints), thus no clause contains the not operator. 

Then the following result holds. 

Theorem 4. (Existence of solution) Given a Horn logic program B, and a 
(consistent) set of examples E = U E~ there is a solution H to induction iff 
H = E~^ is a solution. 
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Proof. We only need to prove one direction. Obviously il H = is a solution, 
there is solution. 

Assume there is a solution H' . Then we will verify that H = A+ is also a 
solution. We will use the monotonic properties of Horn logic programs. 

Consider that the stable model of HU A+ exists, thus HU£’+ ^ _L. (Further- 
more it is unique ( [^, Lemma 1) and coincides with the Least Herbrand model.) 
Now recall that Horn programs verify monotonic properties, thus BUE'^ ^ E~^ , 
simply because |= H+. Finally B U ^ E~ because there is a solution H' 
such that H U H' ^ E~ and B\J H' \= H+, thus we can add the consequences 
H U H' U ^ E ~ , and remove the hypothesis B U ^ E~ . 

Consider that there is no stable model of H U H+. But if there is another 
solution, BiJ P[' \= and BU H' ^ _L. Applying the monotonic properties of 
Horn programs, B U H' U |= H+, thus B U ^ _L. Then there is a stable 
model of H U if there is any solution H' . □ 



3.3 Induction from Several Sets of Examples 

Under stable models semantics, normal logic programs constitute a new declara- 
tive programming paradigm. The idea relies on the fact that logic programs can 
have no stable model, one or several stable models. 

Each stable model is associated with one (alternative) solution to the prob- 
lem described by the program. Thus when there are several stable models, the 
problem has several solutions; and when there is no stable model, the problem 
does not have a solution. 

Typical problems of this kind are combinatorial problems, e.g. finding the 
different ways the nodes of a graph can be colored verifying that no adjacent 
nodes have the same color. Other typical examples are planning problems, i.e. 
finding the sequence of actions that lead to a given goal state from a given initial 
state of the domain. 

For these kind of applications of stable models programming, induction would 
be welcomed. To this end the usual setting of ILP has to be extended. The direct 
extension is to consider several sets of examples. Each one corresponding to an 
intended solution to the problem. 

Definition 2 (Induction of (several) stable models) Given a logic pro- 
gram B, and several sets of examples E\, . . . , En (each one composed of two 
parts, Ei = Ef GE~ ) there is a solution program H to induction iff for each set 
Ei, i : 1, . . . , n there is a stable model, Mi of B U H , such that Mi ^ Ef and 
M, ^ E- (M, n E- = %). □ 

Note that the usual definition of induction in ILP is the particular case for 
a unique set of examples. 

The previous results still worth to characterize induction of several stable 
models. Before we will need the concept of antichain and a result about it (similar 
to one by V. Marek and M. Truszczynski in 0 ). 
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A collection of sets of atoms, form an antichain iff whenever 

Mi C Mj then Mi = Mj, for every i,j : Thus no set is subset of 

another set in the collection. 

Proposition 3. Given a normal logic program B, and a collection of mono- 
tonic models {Mi,i : of B that form an antichain, then there is an 

extension H such that BUH has the models Mi,i : as stable models 

( simultaneously ) . 

Proof. We have to propose a set of rules H that when added to B make all the 
Mi,i : 1, . . . , n stable. 

One possibility is to add an Hi to make each Mi stable. But each of these 
Hi has to be carefully chosen not to forbid the other intended stable Mj. This 
will be achieved if Hi is able to make Mi stable, while keeping all the monotonic 
models of B (unless subsets of Mi). Then we have to add to B rules that are not 
satisfied only by subsets of Mi but any other set (not subset of Mi) will satisfy 
the rules in Hi. 

Consider Hi = {ai ^ NB \ ai G Mi} where NB = not bi, ... , not bm for all 
bj ^ Mi. Each Hi is a set of rules, one for each positive atom in Mi as head, 
and the same body for all of them, the conjunction of the negative literals for 
the atoms not in Mi. Then every subset of Mi does not have any bj and also 
does not have some positive in Mi. Thus it does not verify the rule in Hi 
corresponding to Ofc. 

For any other model that is not a subset of Mi then there is an that is not 
in Mi thus the model does not verify not Or that is in every body of the rules 
of Hi, thus the model satisfies all the rules of Hi. Then Hi does not delete any 
other model — unless subsets of Mi. 

We show that Hi makes Mi stable. Consider the reduct {BUHi)^' . It is equal 
to B^' U Mi, because H^* = Mi (all the bj are not in Mi thus all those literals 
are deleted in the reduct H^\ but the rules are kept as fact rules.) (From this 
point the proof of Theorem E] can be directly followed.) As Mi is a monotonic 
model of B, it is a monotonic model of B^G By Proposition [I] it is a stable 
model of B^' U Mi. Thus Mi is stable oi B iJ Hi. 

Finally the addition of the other Hj to B (for the other stable) do not interfere 
with each other if the {Mi,i : 1, . . . , n} collection form an antichain. 

Recall — as mentioned above — that any other model Mj that is not a subset 
of a given Mi, satisfy the rules in Hi because there is one Or G Mj that is present 
in the body of the rules of Hi as not a^.. Thus Hf^^ — 0 for every i,j : 1, . . . , n. 
Then the reduct (BUH)^* = B^' UH^' U . . . = B^* UH^^G Thus every 

Mi is a stable model of B U H. □ 



Theorem 5. (Existence of solution) Given a normal logic program B, and 
several sets of examples Ei,. . . ,En there is a solution H to induction iff 

i) for each set Ei there is (at least) one eomplete extension Mi of Ei that is a 
monotonic model of B, and 
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a) the set of eomplete extensions {Mi,i : 1, . . . ,n} form an antichain. 

Proof. The antichain condition is needed because the collection of stable models 
of a program always form an antichain. (Recall that if a model is stable, no subset 
of it is (simultaneously) stable.) 

The proof the this theorem follows that of Theorem 

If there is a solution H then BU H has a collection of stable models {Mi, i : 
l,...,n}, thus they form an antichain. Furthermore each Mi is a monotonic 
model of B. Finally, for each Ei there is an Mi that is a complete extension of 
it, because M^ \= Ef {Ef c Mf), and Mi ^ E~ (M* n E~ = 0). 

If there is a collection of monotonic models {Mi, i : 1, . . . , n} of B that form 
an antichain, then by Proposition there is an extension H such that B U H 
has the models {Mi,i : 1, . . . , n} as stable models (simultaneously). 

As each Ei has one Mi that is a complete extension of it. Mi |= Ef and 
M,^E~. □ 

Note that the solution H with only facts is not, in general, a solution for 
several sets of examples, as it was in the other settings. 

This result shows that nonmonotonic hypotheses (i.e. with negation as failure 
in the body) are only truly needed when there are several sets of examples. 

As in the other settings, the existence of solution does not mean that H 
has to be just in the form we used for the proofs. Other solutions can exists, 
as we showed before, but recall that only when we are in the conditions of 
the result presented. In this sense, the collection of {Hi,i : l,...,n} proposed 
can be thought of the most specific solution to the problem, and also the most 
conservative solution (in the sense that it keeps as many monotonic models of 
the extended program as possible). 

For an implementation point of view, we can use the results on the other 
settings. Notice that induction from several sets can be made separately for each 
set, thus as a case of induction from a non-complete set. The only difference is 
that instead of using = M a set of facts, we have to test with the Hi rules. 
Hi = {ai ^ not bi, . . . , not bm \ ai € Mi}, where bi, . . . ,bm are all the bj ^ Mi. 

Example 3. Given normal logic program B 

p ^ not q 

Assume we want an extension of this program, B U H, for the atoms of the sets 
El = {p} and E 2 = {g} to be consequence of corresponding stable models of the 
extension. Note that B has one stable model {p}. 

Consider the complete extensions Mi = Ef and M 2 = Ef. They form an 
antichain collection. They are monotonic models of B. Thus there is solution. 
Build Hi = {p -It- not q} and H 2 = {g ^ not p}. Then B U HiU H 2 = {p t— 
not q, q not p} indeed has Mi and M 2 as stable models. □ 

Example 4- Consider a simple graph with two nodes n(l), n(2), connected by 
an arc a(l,2). We want to find the different ways the nodes of a graph can be 
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colored with two colors, we will represent w{X) as white, being black all the 
other nodes for which w{X) is false. 

Background knowledge B is the graph, and the undirected condition of the 
graph 

n(l) ^ 
n(2) ^ 

2) ^ 

a{X,Y) ^ a{Y,X),n{X),n{Y) 

The sets of examples inform on possible solutions, Ei = {w{l),notw{2)} and 
E 2 = {w(2)}. Notice that the union E\ U E 2 is contradictory. And also that B 
has a unique stable model Mb = {n(l), n(2), o(l, 2), a(2, 1)}. 

Consider the complete extensions Mi = E^ = {w(l)} and M 2 = 
Et = {?u(2)}. They form an antichain collection. But they are not 

monotonic models of B. Consider the complete extensions Mi = E^ U 
Mb = {li'(l), ?^(2), a(l, 2), a(2, 1)} and M 2 = E 2 U Mb = 

{?c(2), n(l), n(2), a(l, 2), a(2, 1)}. They form an antichain collection and they 
are monotonic models of B. Thus there is solution. 

Build Hi = {u>(l) ^ not w(2)} and H 2 = {w{2) ^ not w(l)}. (We are 
considering here only the H with head the predicate of the examples, the other 
can also be added to B.) Then B U Hi U H 2 indeed has Mi and M 2 as stable 
models. 

There are other solutions. Consider for instance H[ = {u>(l) ^ 

not w{2),a{l,2)} and = {u^(2) ^ not w{l),a{2,l)}. (Both can be gener- 
alized to H" = {w(A) ^ not w{Y),a{X,Y)}.) They correspond to the same 
monotonic models as Hi and i? 2 - We just added one of the monotonic conse- 
quences of B to the body of the hypotheses, an addition that can always be 
made without affecting the stable models of a program. □ 

4 Discussion and Related Work 

The results shown do actually apply to other ILP settings, as far as they use LP 
semantics for which stable models is a conservative extension. 

This characterization can be understood as a basis on which alternative tech- 
niques for induction can be defined. For example, it would be interesting to find 
more efficient characterizations, in the line E-IE 0 or NM-IE jS] work on other 
settings, to reduce the search for solutions. Besides most of the work on ILP to 
identify the most general solution, or other criteria for preferred solution, will 
be worth in this new domain. 

This characterization extends the proposal of NM-IE [B], characterizing in- 
duction, in general, for normal logic programs, including, e.g. from contradictory 
background knowledge, contradictory hypothesis. It also clarifies some of the re- 
sults in |7j and |T]. 

Furthermore we identify necessary and sufficient conditions for the existence 
of solution to induction in normal programs. Recall for instance that the con- 
ditions of NM-IE hold on Example [H B has a unique stable, H has a unique 
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stable, including the H solution. But there is no H by the theoretical method 
(neither by the algorithm) in NM-IE because all of them lead to contradictory 
B U H. (Consider B U {not L} = B U {not p t— } there is only one stable (the 
same as for B), {g}. The rules this stable is counter-model are the four tentative 
solutions in that example.) 

Finally, induction from several sets of examples is defined and characterized. 
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Abstract. Since learning with Inductive Logic Programming (ILP) can 
be regarded as the search problem through the hypotheses space, it is 
essential to reduce the search space in order to improve the efficiency. 
In the propositional learning framework, an efficient admissible search 
algorithm called OPUS (Optimized Pruning for Unordered Search) has 
been developed. OPUS employed the effective pruning techniques for un- 
ordered search and succeeded in improving the efficiency. In this paper, 
we propose an application of OPUS to an ILP system Progol. However, 
because of the difference of representation language, it is not applicable 
to ILP directly. We make the conditions clear under which the pruning 
techniques in OPUS can be applied in the framework of Progol. In ad- 
dition, we propose a new pruning criterion, which can be regarded as 
inclusive pruning. Experiments are conducted to assess the effectiveness 
of the proposed algorithms. The results show that the proposed algo- 
rithms reduce the number of candidate hypotheses to be evaluated as 
well as the computational time for a certain class of problems. 



1 Introduction 

Inductive Logic Programming (ILP) p(7l6l20| employs predicate logic as its rep- 
resentation language, so that it can handle structural data and use background 
knowledge, whereas other propositional learner cannot or hardly do. For these 
reasons, ILP has been recognized as one of the most advanced technologies in 
the area of Data Mining and Knowledge Discovery in Databases recently [TUB]. 

However, ILP usually needs enormous computational time to obtain the re- 
sults from a huge amount of data appearing in such problems as Data Mining. 
In order to resolve this drawback and to make ILP more practical, we need more 
efficient algorithms. Several kinds of techniques have been developed to over- 
come this problem. These include: pre and post Druning [l()lllll2] . application 
of Genetic Algorithm |2tiJ. probabilistic search [m]. best-bound search in branch 
and bound method[24j. introducing sampling techniques [30I2B] . integration with 
database management systems |1 141 23] . efficient hypotheses evaluation |2l3l25l22j . 
parallel imDlementation [2lll5l9j . and so on. 



C. Rouveirol and M. Sebag (Eds.): ILP 2001, LNAI 2157, pp. 206 42191 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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Since learning with ILP can be regarded as the search problem, reduction of 
the search space as well as efficient hypotheses evaluation is essential in order 
to improve the efficiency. Inverse Entailment | 18136] is one of the most effective 
methods to reduce the search space. It computes the most specific hypothesis 
(MSH) that bounds the search space before search begins. 

In the propositional learning framework, an efficient search algorithm called 
OPUS (Optimized Pruning for Unordered Search) [331341 has been developed and 
succeeded in improving the efficiency. Note that both OPUS and Progol[T^ are 
designed for the exhaustive and admissible search. 

In this paper, we propose an application of OPUS to Progol. We consider 
the applicability of pruning criteria in OPUS. Since it is not possible to apply 
them directly because of the difference of representation languages, we give the 
condition under which these criteria can be applied to Progol. Furthermore, we 
propose a new pruning criterion called negative cover neutral inclusion, which 
can be regarded as a kind of inclusive pruning. 

This paper is organized as follows. In section [21 we give a brief summary of 
OPUS and consider the relationship between OPUS and Progol. In section E] we 
show the conditions under which the pruning techniques in OPUS can be applied 
to Progol. In addition, we propose a new pruning criterion. Experimental results 
are shown in section |T] and we conclude our paper and discuss future work in 
section O 

2 Hypothesis Search in OPUS and Progol 

In this section, we first give a brief summary of OPUS and then consider the 
similarity and difference of search in OPUS and that in Progol. 

2.1 Brief Introduction of OPUS 

OPUS (Optimized Pruning for Unordered Search) is a best-bound search 
algorithm in branch and bound method, which enables efficient admissible search 
through the search space for unordered search. Unordered search means that 
the search problem for which the order of application of search operators, z.e. 
refinement operator|27] in case of ILP, is not significant. OPUS has succeeded 
in improving the efficiency in the propositional classification fra,mework|34j. and 
recently it has been applied to association rule mining under some restricted 

conditions |3H]. 

There are two versions of OPUS, OPUS‘S and OPUS‘S. OPUS‘S is designed 
for finding the best hypothesis through the search space. On the other hand, 
OPUS'® is designed for obtaining all hypotheses which satisfy with some eval- 
uation criteria. In this paper, we focus on OPUS^ only. Henceforth we denote 
OPUS‘S as OPUS for the sake of simplicity. 

We show the simplified version of OPUS algorithm in Fig. [H While the 
original algorithm|34] contains five kinds of pruning techniques, this simplified 
algorithm mentions only three which we apply to Progol in this paper. 
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1. Put the start node s on a list OPEN of unexpanded nodes. Set s. active to the 
set of all operators, oi, 02 , . . . o„. Set BEST, the best node examined so far, to s. 

2. If OPEN is empty, return BEST. 

3. Remove from OPEN the node n, that maximized Optimistic{n) . 

4. Initialize to n.active a set containing those operators that are still under consid- 
eration, called CUR. 

5. Initialize to {} a set of nodes, called NEW , that will contain the descendants of 
n that are not pruned. 

6. For every operator o in n.active 

(a) Generate n' by application of o to n. Set n' .op to o. 

(b) If value{n') > value(BEST) 

Set Best to n' . Remove from OPEN all nodes x such that optimistic(x) < 
value(BEST). (application of optimistic pruning) 

(c) If optimistic{n') > value(BEST) and neg(n') C neg{n) 

Add n' to NEW. 

else 

Remove n' .op from CUR. (application of optimistic pruning and negative 
cover neutral pruning) 

7. For every node n' in NEW 

(a) If there is another node x in NEW such that neg(x) C neg{n') and pos(x) A 
neg{n') 

Remove n' from NEW . Remove n' .op from CUR. (application of relative 
cover pruning). 

8. For every node n' in NEW, selecting each time the node that minimizes 
optimistic(n'), 

(a) Remove n' .op from CUR 

(b) If optimistic{n') > value(BEST), 

Set n' .active to CUR. Add n' to OPEN. 

9. Go to step 2. 

valuein) denotes the evaluation value of n, optimistic(n) denotes the optimistic 
value of n, neg(n) and pos{n) denote the set of negative and positive examples 
covered by n, respectively. 

Fig. 1. Simplified Version of OPUS Algorithm 



The search efficiency of OPUS is achieved by several effective pruning tech- 
niques as well as the utilization of the operators to be applied to the current 
hypothesis. All of the pruning rules in OPUS, which include not only exclusive 
pruning but also inclusive pruning, are admissible for unordered search under 
some evaluation function m- Exclusive pruning prunes hypotheses which have 
some search operator, whereas inclusive one prunes those which do not have 
any search operator. OPUS gathers all of the possible search operators before 
search. During search, each hypothesis n maintains a set of operators n.active 
that can be applied for further specialization in the search space below n. By 
utilizing the operators in n.active, OPUS realizes the unordered search while 
avoiding the duplicated generation of the same hypotheses. Furthermore, OPUS 
uses these operators to reconstruct search space. Hypotheses which have less 
optimistic value have higher probability to be pruned, where optimistic value is 
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the upper bound of the evaluation value of the descendants of the hypothesis. 
OPUS puts more operators for those hypotheses having less optimistic value to 
accelerate the pruning (step 8 in Fig.[l|). 

2.2 Comparison between OPUS and Progol 

Both of the search in OPUS and A*-like search in Progol are exhaustive and 
admissible searches based on branch and bound method. While OPUS gathers all 
possible search operators before search, Progol constructs MSH. This similarity 
suggests the applicability of OPUS to Progol. 

Other than representation languages, the main difference between OPUS 
and Progol is how to traverse the search space. While OPUS employs unordered 
search, Progol employs fixed-order search in which the order of literals to be 
added are predefined. In Progol, the order of literals in MSH are fixed before the 
search and each candidate hypothesis has an indicate variable k. Progol adds 
kth literal in MSH to the hypothesis having k as indicate variable if possible. 
The parent hypothesis replaces its indicate variable by fc -I- 1, and the newly 
generated hypotheses also have fc -|- 1 as their indicate variable. In this way, all 
combination of literals in MSH are considered. 

We show an example of the search tree in Progol in Fig. |2] In this exam- 
ple, a hypothesis “p(A):- a(A),b(A,B).” will be generated by adding b(A,B) 
to p(A):-a(A). If the order of literals in MSH are different, the same hypoth- 
esis will be generated by different manner. For example, if MSH was p(A):- 
b(A,B),a(A),c(A),d(B)., then a(A) would be added to p(A):-b(A,B). This obser- 
vation shows that there are some parts in search space of Progol in which the 
order of literals to be added is not significant, and thus to which we can apply 
the opus’s unordered search. In order to introduce unordered search to Progol, 
as OPUS does, we associate each candidate hypothesis with a set of literals in 
MSH to be added for further refinement, instead of the indicate variable. In this 
paper, we regard MSH as the set of literals and do not consider variable-splitting 
in order to make the discussion simple. But the introduction of variable-splitting 
is straight forward. 

We now consider the difference of pruning effects between unordered and 
fixed-order search. Suppose that a hypothesis “p(A):- a(A).” is pruned by opti- 
mistic pruning in Fig. [2j The optimistic pruning prunes the hypotheses whose 
optimistic value is less than or equal to the evaluation value of the best hy- 
pothesis. In this case, Progol prunes all hypotheses which is more specific than 
p(A):-a(A). Now, suppose that a hypothesis “p(A):-c(A).” is pruned. In this 
case, no hypothesis except p(A):-c(A) itself is pruned even if there are some 
hypotheses which are more specific than p(A):-c(A). This difference depends on 
the order of literals in MSH, thus the order is important for effective pruning 
in Progol. However, it is difficult to know which literal is useful for pruning a 
priori. 

OPUS realizes the pruning by simply removing an operator from the oper- 
ators of hypotheses, so that the effects of pruning are basically independent of 
the order of application of search operators. By introducing unordered search to 
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p(A):-a(A),b(A,B), 

p(A):-a(A),b(A,B).^ c(A). p(A):-a(A), 

b(A,B), 
c(A),d(B). 



p(A):-a(A),c(A). 



p(A):-a(A), 

b(A,B),d(B). 



p(A):-b(A,B),c(A). — 



p(A):-b(A,B),d(B). 



p(A):-b(A,B), 

c(A),d(B). 



MSB: p(A):-a(A),b(A,B),c(A),d(B). 



Fig. 2. An example of search tree of Progol 



Progol, we can incorporate the effective pruning mechanism into Progol, which 
is basically independent of the order of literals in MSH. 

3 Application of OPUS’s Pruning to Progol 

In this section, we first introduce some terminology used later in this section. 
Then we show the conditions under which the two pruning criteria in OPUS, 
Negative Cover Neutral Pruning and Relative Cover Pruning, can be applied to 
Progol. A new pruning criterion is also discussed. 

Given a hypothesis H, pos{H) and neg{H) denote the set of positive and 
negative examples covered by H, respectively. \H\ denotes the number of literals 
in the body of H. f{H) is the evaluation value of H . H. active denotes the set 
of literals which H maintains for further refinement. V{a) denotes the set of all 
variables appearing in a set of literals a. We treat a clause as a set of literals. 
Given a clause C and a literal L ^ C, V~ {L, C) denotes V ({A}) \ V (C), i. e. local 
variables to L. Link{L, H) denotes the set of all variables in H which appear in 
the variable chains from the head of H to L. For example, Link{s{D, E),p{A) : 
-q{A,B,C),r{B,D).) = {A,B,D}. 

In this paper, we assume that the evaluation function satisfies the restricted 
monotonicity^^ . 

Definition 1. ([24]) Let Hi and H 2 be candidate hypotheses, Pi,Ui be the 
number of positive and negative examples covered by Hi, respectively. Let 
be the number of literals in the body of Hi. Then it is said that the evaluation 
function / satisfies the restricted monotonicity if the following condition holds. 
If p2 < pi, ni < 712, and ci < C2 then f{H{) > /(i?2)- 
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3.1 Application of Negative Cover Neutral Pruning 

The negative cover neutral pruning rule in OPUS is given as follows. 

Negative cover neutral pruning in OPUS|34|: For any node {i.e. candi- 
date hypothesis in case of ILP) n, and an operator (i.e. literal) o G n. active, 
if neg(n) = neg{n A o) then prune all potential solutions [i.e. candidate 
hypotheses) reached via application of o from the search tree below n. 

In propositional learning framework, if neg{n) = neg(nAo), then neg{nAa) = 
neg{n A o A a) holds, where a is some subset of n.active \ {o}. Furthermore, 
pos{n A o A a) C pos(ji A a) holds because n A a is more general than or equal 
to n A o A a. Consequently, f{n A a) > f{n A o A a) holds. 

In case of ILP, neg(n A a) = neg(n Ao A a) does not hold in general, even if 
neg{n) = neg{n A o). Therefore, in order to apply this pruning rule to Progol, 
we need some conditions under which the above relation holds. The condition 
under which negative cover neutral pruning can be applied to ILP is proposed 
in [^. Our condition can be regarded as a variant of that in |24| . 

Negative cover neutral pruning in Progol: Let iL be a candidate hypoth- 
esis, L G H. active be a literal, H' be a candidate hypothesis generated by 
adding L to H, H' .active be H.active\{L\, and a be a subset of H' .active. 
If the evaluation function satisfies the restricted monotonicity, and H and L 
satisfy the following conditions, then f{H A a) > f{H' A a) holds. Thus we 
can prune all descendants of H which has L. 

1. neg(H) = neg(H') 

2. any negative example covered by H is derived with a unique substitution 
for variables in Link{L, H) 0 V{H' .active) 

3. V~{L,H)p[V{H' .active) = % 

Proof. From the first and second condition, a negative example derived from 
H with a substitution for variables in Link{L, H) fl V(H'. active) has also to 
be derived from H' with only the same substitution for these variables. From 
the third condition, V{H) fl V(a) = V(H') fl V{a) holds. Therefore, a negative 
example derived from H' A a with a substitution is also derived from H Aa with 
only the same substitution. Consequently, the equation neg{H Aa) = neg{H' Aa) 
holds. Furthermore the equations, pos{H A a) pos{H' A a) and \H A a\ < 
\H' A a\, hold. Thus f(H A a) > f{H' A a) holds from the definition of the 
restricted monotonicity. □ 

Note that we consider only the substitution for the variables which connect 
to L and also appear in H' .active, because other variables have no effect on 
the difference of the derivations of a negative example by iL A a and H' A a. 
From the same reason, we allow any local variables in L if they do not appear 
in H' .active. 
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3.2 Application of Relative Cover Pruning 

We give the relative cover pruning rule in OPUS as follows. 

Relative cover pruning in OPUSl34|: For any node n and an operator o G 
n. active, if there exists another operator a G n.active such that neg(n/\a) C 
neg{n A o) and pos{n A o) C pos{n A a) then prune all potential solutions 
reached via application of o from the search tree below n. 

In the propositional framework, if neg{nAa) C neg{nAo) holds, the following 
two equations hold: ( 1 ) neg{n A a A a) C neg{n A o A a) where a is a subset of 
n.active \ {n, o}, and ( 2 ) neg{n A a) = neg(n A a A o). The same equations also 
hold in case of positive examples. Here we divide all descendants of n into two 
kinds. The formers do not have a and the latters have. From the first equation, 
by replacing o by a, we can build another hypothesis n A a A a which covers 
more positive and less negative examples than n A o A a does. Then, because 
/(n A a A a) > f{n A o A a) holds, we can prune n A o A a. From the second 
equation, we can prune n A a A o by applying negative cover neutral pruning. 
Consequently, we can prune all descendants of n which has o. 

However, in case of ILP, the above equations do not hold in general. In 
addition, we have to consider the variable chain when replacing o by a. 

Relative cover pruning in Progol: Let H he a, candidate hypothesis, Li G 
H. active and L2 G H. active \ {Ti} be literals. Hi and H2 be candidate 
hypotheses generated by adding Li and L2 to H, respectively. Let Hi. active 
be H.active\{Li}, H2.active be H.active\{L2}, a C H.active\{Li, L2} be a 
set of literals. If the evaluation function satisfies the restricted monotonicity, 
and the following conditions are satisfied, then the equations f{Hi A a) > 
f{H2 A a) and f{Hi A a) > f{Hi A L2 A a) hold. Thus we can prune all 
descendants of H which has L2. 

1 . neg{Hi) C neg{H2) and pos{H2) C pos(Hi) 

2 . any example covered by H is derived with a unique substitution for 
variables in Link{Li,H) U Link{L2, H) 

3 . V~{Li, H) n V{H 1 . active) = 0 and V~{L2, H) fl V{H 2 . active) = 0 

Proof. From the second and third condition, any example covered by both of Hi 
and H2 has to be covered either by both of Hi A a and H2 A a or by none of them. 
Note that V{Hi) n V{a) = V{H2) H V{a) holds from the third condition. Also, 
from the first condition, neg(HiAa) C neg{H2Aa) and pos{H2Aa) C pos{HiAa) 
hold. Furthermore, the variable chains in H2 A a do not break even if we replace 
L2 by Li from the third condition. Therefore, the equation f{HiAa) > f{H2Aa) 
holds from the definition of restricted monotonicity. 

Besides, any negative example covered by both of Hi and H2 is also cov- 
ered by Hi A L2 from the second and third condition. Consequently, we can 
apply negative cover neutral pruning because the following three conditions 
hold: neg{Hi) = neg{Hi A L2); any negative example covered by Hi is derived 
with a unique substitution for variables in Link{L2, Hi) (U Link{L2,H\) n 
V{Hi.active \ {L2})); and V~ {L2, Hi) fl V{Hi. active \ {L2}) = 0 - Therefore, 
/(Hi A a) > /(Hi A L2 Aa) holds. □ 
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3.3 Negative Cover Neutral Inclusion 

Negative cover neutral pruning requires that local variables in literal L which 
is added to a hypothesis H do not appear in H. active \ {L}, the set of literals 
to be added for further refinement. If L has some local variables which appear 
in H. active \ {L}, we can divide the subsets of H.active \ {L}, referred to as 
a, into two kinds: those which do not have any literal including local variables 
in L, and those which have such a literal. If H and L satisfy the conditions 
of negative cover neutral pruning except the last condition, we can prune all 
descendants of H AL which are generated by using the former kind of a because 
neg{H Aa) = neg{H A L Aa) holds. We call this pruning negative cover neutral 
inclusion. 

We summarize this pruning formally as follows. 

Negative cover neutral inclusion in Progol: Let iJ be a candidate hypoth- 
esis, L G H.active be a literal, H' be a candidate hypothesis generated by 
adding L to H, H' .active be H.active \ {L}, and a be a subset of H' .active 
such that V~{L,H) fl V{a) = 0. If the evaluation function satisfies the re- 
stricted monotonicity, and H and L satisfy the following conditions, then 
f{H A a) > f{H' A a) holds. Thus we can prune all descendants of H which 
has L and a 

1. neg{H) = neg{H') 

2. any negative example covered by H is derived with a unique substitution 
for variables in Link{L, H) fl V{H' .active) 

Proof. The proof is directly derived from the proof of negative cover neutral 
pruning. □ 

4 Experiments 

We implemented the proposed algorithms and developed the prototype system 
based on Progol’s framework in SICStus Prolog. Current implementation can 
only take the determinate background knowledge, and can not handle variable- 
splitting. As the evaluation function, we adopt that of Progol which satisfies the 
restricted monotonicity. 

We used four data sets to examine the effectiveness of the proposed algo- 
rithms: (l)Illegal positions in the KRK endgame (referred to as /Lfi’/L/l [TH]. 
(2)Email classification {Email)^\, (3) Respiration during musical performance 
{Respiration)^Sl, and (4)Finite element mesh design (Mesh)^. All of these data 
sets have only determinate background knowledge (we used the determinate ver- 
sion of background knowledge for Mesh) . Email, Respiration and Mesh have mul- 
tiple classes. For Email and Respiration, one class was given as positive examples 
and remaining classes were given as negative examples. For Mesh, we learned 
each class by using the corresponding data set provided in the original data. 

In the experiments, we compared the number of generated candidate hy- 
potheses and search time. We show the experimental results in Table. [I]-Table. 
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|4] In each table, hyp and time denote the number of generated candidate hy- 
potheses and search time (sec.), respectively. Fixed, Unordered, Negative, Rela- 
tive, Inclusion, All denote fixed-order search, unordered search, negative cover 
neutral pruning, relative cover pruning, negative cover neutral inclusion and all 
pruning, respectively. While Emailhas 84 classes, because of the space limitation, 
we show the results of only those 20 classes which took more computational time 
in fixed-order search. KRKI and Email have no output variable, so that negative 
cover neutral inclusion was not applied. All of these except for Fixed adopt un- 
ordered search and we use the original pruning criteria in Progol, i.e. optimistic 
pruning, in every case. Each number in the parentheses of Unordered column is 
the ratio to Fixed and the others are the ratio to Unordered. 



Table 1. Experimental results of KRKI Data. 





Fixed 


Unordered 


Negative 


Relative 


All 


All/Fixed 


hyp 


1092 


851 (0.78) 


210 (0.25) 


210 (0.25) 


210 (0.25) 


0.19 


time 


95.3 


72.9 (0.77) 


13.2 (0.18) 


13.2 (0.18) 


13.2 (0.18) 


0.14 



Table 2. Experimental results of Respiration Data 



\Fixed\ Unordered \ Negative \ Relative \ Inclusion \ All \All/Fixed\ 



expiration 



hyp 


3022 


2601 (0.86) 


2546 (0.98) 


2421 (0.93) 


2158 (0.83) 


1958 (0.75) 


0.65 


time 


19.8 


17.6 (0.89) 


17.7 (1.01) 


16.3 (0.93) 


14.9 (0.85) 


14.0 (0.80) 


0.71 



inspiration 



hyp 


4771 


4136 (0.87) 


3923 (0.95) 


3790 (0.92) 


3322 (0.80) 


3003 (0.73) 


0.63 


time 


32.2 


28.2 (0.87) 


27.7 (0.98) 


26.0 (0.92) 


23.2 (0.82) 


21.9 (0.78) 


0.68 



no 



hyp 


2769 


2297 (0.83) 


2188 (0.95) 


2097 (0.91) 


1940 (0.84) 


1759 (0.77) 


0.64 


time 


18.6 


16.0 (0.86) 


15.5 (0.97) 


14.5 (0.91) 


13.9 (0.87) 


12.9 (0.81) 


0.69 



As a whole, unordered search reduces the number of generated candidate 
hypotheses and the search time. For KRKI data, both of negative cover neutral 
and relative cover pruning work very effectively. We believe that the results of 
negative cover neutral and relative cover pruning became the same by accident. 
Negative cover neutral inclusion works well in Respiration data. Negative cover 
neutral and relative cover pruning also improve the efficiency in most classes for 
Email data. The results of Mesh data show the effectiveness of the combination 
of pruning. 

There are some cases where Unordered took more time than Eixed, e.g. class4 
in Mesh, and Negative and Relative took more time than Unordered, e.g. class4 
and class5 in Mesh, respectively. We think the reason is the overheads for check- 
ing the conditions for pruning, as well as the garbage collection. And, we think 
the difference in speed-ups for different domains and different classes in the same 
domain comes from the characteristics of the data sets. 
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Table 3. Experimental results of Email Data 



class 




Fixed 


Unordered 


Negative 


Relative 


All 


All/Fixed 


cl6 


hyp 


8682 


5605 (0.65) 


1408 (0.25) 


414 (0.07) 


386 (0.07) 


0.04 




time 


1604.5 


1055.7 (0.66) 


247.1 (0.23) 


69.0 (0.07) 


64.0 (0.06) 


0.04 


c56 


hyp 


1737 


990 (0.57) 


377 (0.38) 


193 (0.19) 


187 (0.19) 


0.11 




time 


312.9 


180.5 (0.58) 


64.9 (0.36) 


31.9 (0.18) 


30.8 (0.17) 


0.10 


c09 


hyp 


1611 


1101 (0.68) 


475 (0.43) 


198 (0.18) 


191 (0.17) 


0.12 




time 


293.7 


204.5 (0.70) 


84.5 (0.41) 


33.4 (0.16) 


32.1 (0.16) 


0.11 


c28 


hyp 


1557 


657 (0.42) 


466 (0.71) 


299 (0.46) 


299 (0.46) 


0.19 




time 


274.4 


113.9 (0.41) 


79.3 (0.70) 


50.0 (0.44) 


50.1 (0.44) 


0.18 


cl7 


hyp 


1421 


659 (0.46) 


425 (0.64) 


317 (0.48) 


317 (0.48) 


0.22 




time 


247.3 


114.4 (0.46) 


72.2 (0.63) 


53.2 (0.46) 


53.2 (0.46) 


0.22 


cl2 


hyp 


1188 


458 (0.39) 


327 (0.71) 


169 (0.37) 


162 (0.35) 


0.14 




time 


212.7 


79.8 (0.38) 


54.6 (0.68) 


27.2 (0.34) 


26.9 (0.34) 


0.13 


c53 


hyp 


1162 


301 (0.26) 


235 (0.78) 


170 (0.56) 


170 (0.56) 


0.15 




time 


208.9 


53.3 (0.26) 


40.9 (0.77) 


29.2 (0.55) 


29.2 (0.55) 


0.14 


c57 


hyp 


1119 


420 (0.38) 


316 (0.75) 


113 (0.27) 


113 (0.27) 


0.10 




time 


199.7 


73.8 (0.37) 


54.4 (0.74) 


18.3 (0.25) 


18.3 (0.25) 


0.09 


c05 


hyp 


652 


169 (0.26) 


106 (0.63) 


62 (0.37) 


61 (0.36) 


0.09 




time 


125.7 


30.0 (0.24) 


18.0 (0.60) 


10.1 (0.34) 


9.9 (0.33) 


0.08 


c22 


hyp 


612 


347 (0.57) 


281 (0.81) 


176 (0.51) 


175 (0.50) 


0.29 




time 


108.5 


60.3 (0.56) 


47.9 (0.79) 


29.5 (0.49) 


29.4 (0.49) 


0.27 


c04 


hyp 


595 


198 (0.33) 


167 (0.84) 


71 (0.36) 


71 (0.36) 


0.12 




time 


107.8 


34.3 (0.32) 


28.6 (0.83) 


11.6 (0.34) 


11.6 (0.34) 


0.11 


c52 


hyp 


591 


248 (0.42) 


214 (0.86) 


132 (0.53) 


131 (0.53) 


0.22 




time 


102.2 


42.1 (0.41) 


35.3 (0.84) 


21.1 (0.50) 


20.9 (0.50) 


0.20 


c36 


hyp 


583 


204 (0.35) 


181 (0.89) 


109 (0.53) 


109 (0.53) 


0.19 




time 


101.4 


34.4 (0.34) 


30.2 (0.88) 


17.6 (0.51) 


17.6 (0.51) 


0.17 


c59 


hyp 


539 


102 (0.19) 


101 (0.99) 


75 (0.74) 


75 (0.74) 


0.14 




time 


96.1 


17.3 (0.18) 


17.1 (0.99) 


12.6 (0.73) 


12.5 (0.72) 


0.13 


c06 


hyp 


513 


651 (1.27) 


620 (0.95) 


111 (0.17) 


111 (0.17) 


0.22 




time 


91.7 


117.1 (1.28) 


110.9 (0.95) 


19.0 (0.16) 


19.0 (0.16) 


0.21 


c20 


hyp 


513 


149 (0.29) 


149 (1.00) 


149 (1.00) 


149 (1.00) 


0.29 




time 


88.4 


27.8 (0.31) 


24.3 (0.87) 


24.3 (0.88) 


24.3 (0.88) 


0.28 


c85 


hyp 


460 


172 (0.37) 


157 (0.91) 


102 (0.59) 


102 (0.59) 


0.22 




time 


80.4 


29.2 (0.36) 


26.4 (0.90) 


16.7 (0.57) 


16.6 (0.57) 


0.21 


clO 


hyp 


446 


238 (0.53) 


224 (0.94) 


112 (0.47) 


112 (0.47) 


0.25 




time 


78.5 


41.3 (0.53) 


38.6 (0.94) 


18.8 (0.46) 


18.8 (0.46) 


0.24 


c43 


hyp 


444 


342 (0.77) 


214 (0.63) 


122 (0.36) 


119 (0.35) 


0.27 




time 


76.3 


59.3 (0.78) 


35.6 (0.60) 


19.7 (0.33) 


18.9 (0.32) 


0.25 


c02 


hyp 


412 


115 (0.28) 


101 (0.88) 


69 (0.60) 


69 (0.60) 


0.17 




time 


73.0 


19.4 (0.27) 


16.9 (0.87) 


11.2 (0.58) 


11.3 (0.58) 


0.15 
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Table 4. Experimental results of Mesh Data 



1 1 Fixed Unordered \ Negative Relative \ Inclusion \ All 


All/ 

Fixed 


class 1 


hyp 

time 


222039 

34254 


199751 (0.90) 
31280(0.91) 


188166 (0.94) 
29020 (0.93) 


199501 (1.00) 
32003(1.02) 


177136(0.89) 
26145 (0.84) 


160944 (0.81) 
22616 (0.72) 


0.72 

0.66 


class 2 


hyp 

time 


480007 

79056 


475217(0.99) 

79885(1.01) 


458922 (0.97) 
76388 (0.96) 


474409(1.00) 

80483(1.01) 


448665 (0.94) 
69041 (0.86) 


405955 (0.85) 
61087 (0.76) 


0.85 

0.77 


class 3 


hyp 

time 


314054 

47823 


303754 (0.97) 
47473 (0.99) 


295625 (0.97) 
46514(0.98) 


301913(0.99) 
48285 (1.02) 


279749 (0.92) 
41329 (0.87) 


262783 (0.87) 
38686 (0.81) 


0.84 

0.81 


class 4 


hyp 

time 


209686 

35271 


208282 (0.99) 
36003(1.02) 


206628 (0.99) 
36401 (1.01) 


208012 (1.00) 
35790 (0.99) 


193901 (0.93) 
31437(0.87) 


190247 (0.91) 
29551 (0.82) 


0.91 

0.84 



class 5 



hyp 

time 


110948 

16393 


91929(0.83) 
14809 (0.90) 


90125 (0.98) 
14723 (0.99) 


90242 (0.98) 
15160(1.02) 


87224 (0.95) 
13048 (0.88) 


85180 (0.93) 
12435 (0.84) 


0.77 

0.76 


class 6 


hyp 

time 


129172 

21748 


122743(0.95) 
20724 (0.95) 


120453 (0.98) 
20556 (0.99) 


122312(1.00) 

21238(1.02) 


113786(0.93) 
18668 (0.90) 


111368 (0.91) 
17859 (0.86) 


0.86 

0.82 


class 7 


hyp 

time 


95406 

13932 


82003 (0.86) 
10763 (0.77) 


75317(0.92) 
10208 (0.95) 


74795 (0.91) 
10160 (0.94) 


59830(0.73) 
7350 (0.68) 


48556 (0.59) 
6249 (0.58) 


0.51 

0.45 


class 8 


hyp 

time 


63624 

10651 


51093(0.80) 

8558(0.80) 


48774 (0.95) 
8237 (0.96) 


48669 (0.95) 
8133 (0.95) 


42877 (0.84) 
6562(0.77) 


37673 (0.74) 
5749 (0.67) 


0.59 

0.54 


class 9 


hyp 

time 


69989 

12506 


57928 (0.83) 
10414(0.83) 


55173 (0.95) 
9843 (0.95) 


57322 (0.99) 
10286 (0.99) 


52004 (0.90) 
8534(0.82) 


47348 (0.82) 
7768 (0.75) 


0.68 

0.62 


class 10 


hyp 

time 


38806 

6556 


27577(0.71) 

4176(0.64) 


26640 (0.97) 
4041 (0.97) 


27511 (1.00) 
4200(1.01) 


19868(0.72) 

2614(0.63) 


17362 (0.63) 
2351 (0.56) 


0.45 

0.36 


class 11 


hyp 

time 


56819 

8015 


50539 (0.89) 
7130(0.89) 


50421 (1.00) 
6962 (0.98) 


49587(0.98) 

7160(1.00) 


45242 (0.90) 
5863 (0.82) 


38606 (0.76) 
5037(0.71) 


0.68 

0.63 


class 12 


hyp 

time 


60527 

8689 


49069 (0.81) 
6905 (0.79) 


45788 (0.93) 
6217(0.90) 


48569 (0.99) 
6903(1.00) 


41313(0.84) 
5290 (0.77) 


33769 (0.69) 
4335 (0.63) 


0.56 

0.50 



However, as a whole, these experimental results show the usefulness of the 
proposed algorithms. 

5 Conclusion and Future Work 

In this paper, we proposed an application of OPUS to Progol. We pointed out 
the applicability of the unordered search and showed the conditions under which 
the pruning techniques in OPUS can be applied in the framework of Progol. 
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In addition, we proposed a new pruning criterion called negative cover neutral 
inclusion. Note that the pre-computed MSH acts a crucial role in defining these 
conditions and pruning criteria. 

Our future work includes the implementation of the system which can deal 
with variable-splitting and non determinate background knowledge. We think 
the encoding techniques of hypothesis proposed in is useful for handling 
variable-splitting. In this encording, each candidate hypothesis is represented by 
an adjacency matrix. When we use this representation for the remaining opera- 
tors to be applied to the candidate hypothesis, we can handle variable-splitting 
more directly. It would be expected that, in order to handle non determinate 
background knowledge, an additional cost is required for checking the unique- 
ness of substitution for the derivations of examples. However, we believe that 
this cost could be reduced by using techniques in deductive database |22], such as 
bottom-up computation, OLDT, relational calculus and so on. We are going to 
examine the effectiveness of the proposed algorithms by many kinds of data sets 
including non determinate background knowledge on the new implementation. 

Beside, we are planning to apply the proposed algorithms to descriptive ILP 
systems such as WARMR[^, since OPUS has been applied to association rule 
mining recently |iI5] . 
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Abstract. We propose to use ILP techniques to learn sets of temporally 
constrained events called chronicles that a monitoring tool will use to 
detect pathological situations. ICL, a system providing a declarative bias 
language, was used for the experiments on learning cardiac arrhythmias. 
We show how to obtain properties, such as compactness, robustness or 
readability, by varying the learning bias. 



1 Introduction 

In medical domains such as cardiology, intensive care units make use of more 
and more sophisticated monitoring tools. These tools have improved the surveil- 
lance and care of patients suffering from strong disorders. However, many false 
alarms are still generated and, from our point of view, these tools rely too much 
on signal processing algorithms. There exists a gap between the understand- 
ing level of clinicians and the information displayed by monitoring tools. To be 
more informative and explicative we think, as Lavrac et al. [S], that monitor- 
ing tools must manipulate more abstract knowledge such as temporal relations 
between interesting events reflecting the patient’s state. We have proposed in 
|2] to associate signal processing techniques with high-level temporal reasoning 
for patient monitoring. The first module processes input signals and outputs 
symbolic attributed events that feed a chronicle recognizer which attempts to 
detect specific patterns among these events. Chronicles are event patterns which 
impose temporal constraints among a set of events. 

As devising chronicles is not, in general, an easy task, we propose to use 
machine learning techniques in order to obtain accurate and interesting charac- 
terizations of pathological situations from examples of input signals related to 
disorders that may affect some patient. In the domain of coronary care units, the 
signals are multi-channel electrocardiograms (ECGs) and the situations to rec- 
ognize are cardiac arrhythmias. As temporal relations among events are crucial 
as well as a specification language which can lead to informative explanations, we 
have chosen to use inductive logic programming (ILP). This is a major difference 
between Kardio [I] and our own approach. Kardio uses feature-based induction, 
thus, it can only learn predefined propositional structural relations. Target con- 
cepts are represented as first-order formulas in ILP. This makes the rules more 
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Fig. 1. A normal ECG (on the left) and a bigeminy ECG (on the right). 



abstract and easier to understand and that is an essential point in our context 
as tools have to explain their results to users. Kokai et al. [B] proposed to learn 
attributed grammars for arrhythmia recognition in ECG from elementary curve 
segments. Their approach relies on grammar refinement which, they say, is not 
well suited to learning constraints in rules. Also, the learnt grammars specify 
only one cardiac cycle which is too short to describe recurrent phenomena as 
cardiac arrhythmias. 

In this paper, our goal is to demonstrate that ILP is a powerful and smart 
technique that makes it relatively easy to learn knowledge adapted to the prob- 
lem at hand. Precisely, we show how to play with bias specifications in order to 
learn concept definitions enjoying such different properties as robustness, read- 
ability or recognition efficiency. DLAB, the declarative bias language of ICL m, 
has reveal quite useful and flexible to achieve this goal. The first section gives 
some basic knowledge about cardiac arrhythmias. The next section presents the 
data and learning materials. Next, we describe the results obtained on learning 
five arrhythmias. Finally, we conclude and give some perspectives to this work. 



2 Electrocardiograms 

The electrocardiogram provides very important cues for cardiac analysis and di- 
agnosis. First of all, they can be recorded easily with non invasive leads that are 
put at particular locations of the body surface. Second, ECGs can be inspected 
visually by physicians in order to analyze the ordering and the shape of particu- 
lar waves which can be related directly to the patient’s heart activity. The most 
important waves are the P wave and the QRS complex which are related respec- 
tively to the depolarization of the atria and the depolarization of the ventricles. 
The EGG presents series of such waves which are organized in cardiac cycles 
representing a complete heart contraction and an electrical potential recovery. 
The normal cycle is a succession of: P wave - QRS complex - T wave. The tem- 
poral intervals between these waves are commonly used for diagnosis and noted 
PR, QT and RR (see figure [U left part). 

Gardiac arrhythmias are disorders of rates, rhythms and conduction originat- 
ing in heart areas with dysfunctions. Arrhythmias can be recognized by specific 
arrangements of EGG waves satisfying temporal constraints. For example, figure 
[U presents on the left a normal EGG where all heart elements (seem to) work 
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fine. The ECG on the right is related to an arrhythmia called bigeminy, where 
one can note the presence of extra ventricular beats due to an ectopic focus 
which acts as an extra pacemaker . Bigeminy is classically defined by the wave 
sequence P - QRS - QRS’ - P - QRS - QRS’, where QRS’ denotes a QRS having 
an abnormal shape together with the temporal constraints normal PR, short 
RR’ and long R’R, where R’ denotes the abnormal QRS. This is the kind of 
temporal patterns that chronicle recognition algorithms m are able to detect. 

Clearly, the definition of bigeminy above is best represented by a first-order 
formula as it contains true relations between events. In fact, the following Prolog 
clause gives a straightforward specification of this definition: 

bigeminy qrs(R0, normal, PO, _), qrs(Rl, abnormal, PI, RO), rrl(R0, Rl, short). (1) 

It states that, in bigeminy, the temporal interval between a normal and an 
abnormal QRS is short. To learn specifications like formula O we need meth- 
ods that can induce temporal constraints such as simple or delayed precedence 
between events. Inductive logic programming (ILP) aims at inducing first-order 
representations of target concepts and is quite adapted to this task uni. 

3 Learning Algorithms and Materials 

In this section, we first recall some principles of ILP. Then we describe the 
learning data that were used to learn cardiac arrhythmias. Finally, we show how 
to formulate a bias in order to improve the learning efficiency. 

ICL: an Inductive Logic Programming system 

The aim of ICL is to find a first-order theory H C Lh that is complete (it covers 
all the given positive examples) and is consistent (it covers no negative exam- 
ples) . Lh is the hypothesis language and is generally a subset of first-order logic. 
An interesting feature of ILP systems is to provide the users with declarative 
tools which provide means to specify Lh- ICL [12] proposes a high-level concept 
specification language called DLAB in which the hypothesis language syntax can 
be defined. DLAB grammars are preprocessed in order to generate candidate hy- 
potheses from the most general to the specific ones (under 0-subsumption). 

ICL enables also multi-class learning |7] . The idea beyond multi-class learning 
is simple: when learning one particular class consider as positive only those 
examples belonging to this class and as negative all the examples belonging to 
the remaining classes. This is an attractive option in our case as we want to 
discover definitions which discriminate among several (> 2) arrhythmias. 

Data 

In order to assess the versatility of ICL and DLAB, we have selected a subset 
of arrhythmias related to different cardiac disorders involving various parts of 
the heart: the atria- ventricular (AV) node for the Mobitz type II arrhythmia 
(class mobitz2), the left bundle branch for the left bundle branch block (class 
Ibbb) and the ventricle for bigeminy. ECGs related to a normal heart activity 
were also added (class normal). These 4 classes are not so difficult to separate. 
To augment the difficulty, we have added one class: the premature ventricular 
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begin(model(bigeminy_119_l) ) . 



bigeminy . 

wave (pi, p, 651, normal, null) 

wave(rl, qrs , 836, normal, pi). 

wave(r2, qrs, 1357, abnormal, rl) . 

wave(p2, p, 2528, normal, r2) . 

wave(r3, qrs, 2686, normal, p2) . 

wave(r4, qrs, 3203, abnormal, r3) . 

wave (p3 , p , 4428 , normal , r4) . 

wave(r5, qrs, 4577, normal, p3) . 

wave(r6, qrs, 5086, abnormal, r5) . 

wave(p4, p, 6279, normal, r6) . 

end(model(bigeminy_119_l) ) . 



Fig. 2. A bigeminy arrhythmia ECG and its related specification as an ICL example 



contraction arrhythmia (PVC) is characterized by sparse extra contractions due 
to an ectopic focus. The presence of ectopic beats makes this class close to 
bigeminy. The fact that ectopic beats are sparse makes this class close to the 
normal class as large portions of PVC ECGs are normal. 

Real recorded ECG examples taken from the MIT BIH database |H] were 
used. 20 ECGs lasting 10s each were associated to each class. Every ECG is 
preprocessed by a signal processing algorithm and transformed into a symbolic 
representation based on P and QRS events [Q. This is the same module that is 
used on-line to produce symbolic events that will be processed by the chronicle 
recognizer. It aims: i) at detecting and at identifying the markers of the cardiac 
activity, P waves, QRS complexes, ii) at characterizing each wave by feature 
vector, and in) at classifying waves in normal or abnormal classes. This mod- 
ule is not further detailed here (see |2]) but it is of major importance as the 
performance of the “symbolic part” of the system relies on good input data. 
Symbolic electrocardiograms 

Figure El presents an ECG example coded as a set of prolog clauses. To each 
event is associated its type, its occurrence time in the ECG and a qualification 
(normal or abnormal) of the related wave shape. This information is coded by 
the predicate wave (Event , Type, Time, Qual, Pre_event) which states that 
Event is related to a wave of type Type (p or qrs), which occurred at time 
Time, the shape of which is Qual (normal or abnormal) and Pre_event just 
precedes Event on the ECG. We chose to code the structural information (order 
of events) as a 5th argument of the predicate wave. We could have used an 
additional relational predicate as well. 

Background knowledge 

The aims of background knowledge is to ease learning by bringing knowledge of 
the domain from which the data come from as well as search knowledge which 
will be used to prune the clause space. In m , the concept of declarative learning 
bias is studied and its importance and properties are clearly demonstrated. 
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1 - 1 : [ 

len-len: [p_wave(Pl, 1 - 1 : [normal , abnormal], RO) , 
qrs(Rl, 1-1: [normal, abnormal], PI), 

0-len: [rrl (RO , R1 , 1-1: [short, normal, long]), 
prl(Pl, R1 , 1-1: [short, normal, long])]], 
len-len: [p_wave (PI, 1-1: [normal, abnormal], RO) , 
ppKPO, PI, 1-1: [short, normal, long])], 
len-len: [qrs (R1 , 1-1: [normal, abnormal], RO) , 

0-l:[rrl(R0, Rl, 1-1: [short, normal, long])]] 

]. 

Fig. 3. Syntactic specification of a cardiac cycle in DLAB 

ICL [12] comes with DLAB, a declarative language for bias specification. 

A DLAB grammar consists in rule templates that fixes the syntactic form of 
clauses defining the target concept. These templates have the form Head <- 
Body where Head and Body are DLAB terms. A term is either an atomic for- 
mula or a set specification having the form 1-h; [ell,el2, . . . ,eln] . Such an 
expression means: choose from 1 to h elements from the set [ell , el2 , . . . , eln] . 
The special symbol len can be used to specify the total length of the list. 
These expressions are used as combinatorial generators that can produce all 
the possible instances satisfying the templates. For example, the DLAB term 
p(2-len: [ell,el2,el3] ) generates the following expressions: 
p(ell,el2), p(ell,el3), p(el2,el3), p(ell,el2,el3). 

Figure l^shows how the specification of a cardiac cycle may be formulated in 
DLAB. It says that a cardiac cycle is composed of exactly one (range 1-1 line 
1) of the following configurations: 

— a P-wave followed by a QRS complex followed by optional (range 0-len) 
temporal constraints ( prl and rrl in lines 2-5). For instance, the following 
expression satisfies this DLAB specification: 

p_wave(Pl, normal, RO) , qrs(Rl, abnormal, PI), prl (PI, Rl , long), 

— a P-wave alone, in this case the temporal constraint between this wave and 
the preceding one is mandatory (lines 6 and 7), 

— a QRS complex alone, in this case the temporal constraint between this wave 
and the preceding one is optional (lines 8 and 9). 

Finally, a rule body is a sequence of such DLAB expressions telling ICL that 
an arrhythmia is defined by one or several cardiac cycles. Such a specification 
may appear quite sophisticated and restrictive. We have tried more permissive 
biases but either they led to prohibitive learning times or the quality of induced 
rules was very poor. Our objective has been to induce clauses that could be 
tailored in order to take into account such notions as readability, efficiency or 
robustness. Basing the induction on the notion of cardiac cycle enables readabil- 
ity since this is a concept that is commonly used by specialists for arrhythmia 
description or for diagnosis. 
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class (bigeminy) ’/.[IS, 0, 0, 0, 0], [5, 19, 18, 18, 17] 

qrs(R0, abnormal, _) , p_wave(Pl, normal, RO) , qrs(Rl, normal, PI), 
qrs(R2, abnormal, Rl) , rrl(Rl, R2, short), 
class (bigeminy) 7. [5, 0, 0, 0, 0], [13, 19, 18, 18, 17] 

qrs(R0, normal, _) , p_wave(Pl, normal, RO) , qrs(Rl, abnormal, PI). 
class(lbbb) ’/.[O, 19, 0, 0, 0], [18, 0, 18, 18, 17] 

qrs(R0, abnormal, _) , p_wave(Pl, normal, RO) , qrs(Rl, abnormal, PI), 
class (mobitz2) ”/.[0, 0, 16, 0, 0], [18, 19, 2, 18, 17] 

p_wave(P0, normal, _) , equal(P0, RO) , 
p_wave(Pl, normal, RO) , qrs(Rl, normal, PI), 
class (mobitz2) 7. [0, 0, 2 , 0 , 0] , [18, 19, 16, 18, 17] 

p_wave(P0, normal, _) , equal(P0, RO) , 
p_wave(Pl, normal, RO) , qrs(Rl, abnormal, PI). 
class(normal) 7.[0, 0, 0, 17, 4], [18, 19, 18, 1, 13] 

p_wave(P0, normal, _) , qrs(R0, normal, PO) , 
p_wave(Pl, normal, RO) , qrs(Rl, normal, PI), 
p_wave(P2, normal, Rl) , qrs(R2, normal, P2) , 

p_wave(P3, normal, R2) , qrs(R3, normal, P3) , p_wave(P4, normal, R3) . 
class(pvc) 7.[0, 0, 0, 0, 17], [18, 19, 18, 18, 0] 

p_wave(P0, normal, _) , qrs(R0, normal, PO) , 
p_wave(Pl, normal, RO) , qrs(Rl, normal, PI), 
qrs(R2, abnormal, Rl) , rrl(Rl, R2, short). 

Fig. 4. Rules induced for a learning experiment on 5 classes 



4 Results 

The first goal of the experiments was to test whether understandable and use- 
ful arrhythmia specifications could be learnt from temporal data coming from 
example ECGs. A second goal was to assess the flexibility of using a declara- 
tive bias for imposing desirable properties such as readability or robustness on 
induced concepts. For instance, inducing the shortest clauses can be achieved 
by imposing only one cardiac cycle. This should bring efficiency to recognition 
as such rules specify less events to be recognized. Inducing longer rules enhance 
readability since a phenomenon regularity may be easier to assessed. A bias 
imposing several cycles, e.g. three or four, would be used to this purpose. 
Inducing rules for five arrhythmias 

Figure |4] displays the rules obtained from ICL when imposing one mandatory 
cardiac cycle and four optional ones. Those rules produce the shortest chronicles 
which are expected to enable early detection. To each rule is associated the 
number of examples covered by this rule in each class (respectively bigeminy, 
Ibbb, mobitz2, normal and pvc) and the number of examples covered by its 
negation. For example, the list [13,0,0,0,0] associated to the first rule for 
bigeminy in figure |4] means that this rule covers 13 positive examples from class 
bigeminy, and none from the classes Ibbb, mobitz2, normal and pvc. 

Though only one cycle was mandatory, every rule states constraints on at 
least two cycles. Two types of temporal constraints are used: sequential con- 
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Table 1. Learning 5 classes: statistics of 10-fold cross-validation 



Set I Acc TrueTot FalseTotl TrAcc I Correct*Incorrect/class # 



1 1 


1.000 


10 


0 


1 0.989 


1 


[1,1,5,1,21 


* 


[0,0,0,0,01 


# 


2 1 


1.000 


10 


0 


1 0.989 


1 


[1,2,2,2,31 


♦ 


[0,0,0,0,01 


# 


3 1 


1.000 


10 


0 


1 1.000 


1 


[3,2,2,1,21 


♦ 


[0,0,0,0,01 


# 


4 1 


1.000 


10 


0 


1 1.000 


1 


[2,4,2,0,21 


* 


[0,0,0,0,01 


# 


5 1 


1.000 


10 


0 


1 0.989 


1 


[2,1,2,2,31 


♦ 


[0,0,0,0,01 


# 


6 1 


1.000 


10 


0 


1 0.989 


1 


[3,0,2,3,21 


♦ 


[0,0,0,0,01 


# 


7 1 


0.900 


9 


1 


1 1.000 


1 


[1,2,0,5,11 


* 


[0,0,0,1,01 


# 


8 1 


1.000 


10 


0 


1 0.989 


1 


[2,1,3,2,21 


* 


[0,0,0,0,01 


# 


9 1 


1.000 


10 


0 


1 0.989 


1 


[2,4,0,1,31 


* 


[0,0,0,0,01 


# 


10 1 


1.000 


10 


0 


1 0.989 


1 


[3,3,2,2,01 


* 


[0,0,0,0,01 


# 


Tot: 9.900 

Accuracy: 0.990 


99 

(+/-0 


1 

030) 


(Training 


set Accuracy: 


0.992 (+/-0 


005)) 



straints between events by means of the third argument of p_wave and qrs 
predicate literals and temporal constraints on intervals by means of predicates 
prl and rrl which appear to be the most used by specialists. Two rules were 
necessary for mobitz2. This arrhythmia can be characterized by the episodic 
absence of a ventricular contraction. It is sometimes accompanied by a right 
bundle branch block (rbbb) provoking an enlarged QRS. This was the case for 
some of the examples of this class. The two rules that were obtained reflect this 
fact: in the first one the QRS are normal whereas in the second one the QRS are 
abnormal and then denote a joint rbbb. 

Validation 

Table [I] gives the statistics obtained after a 10- fold cross-validation on learning 
5 classes. 10% of the examples were left out for test in each round. The column 
TrAcc gives the training accuracy and the column Acc gives the test accuracy 
for each round. 99.2% and 99% global accuracy was obtained for training and 
test respectively. These results are very good and show that accurate definitions 
may be induced from complex data. 

The rules learnt in the previous experiments were also assessed by specialists 
from a qualitative point of view. Though sometimes they were surprised by some 
definitions which did not correspond to the general definition they were used to, 
they rated all the rules as being correct and relevant. 



5 Conclusion 

This paper has presented an application of ILP techniques to the acquisition 
of a set of high-level temporal patterns (or chronicles) characterizing cardiac 
arrhythmias. The main novelty in this application is the fact that we are dealing 
with temporal and structured data. The ultimate goal is to get a chronicle base 
which is used by a chronicle recognition tool to analyse, in an on-line monitoring 
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context, an ECG signal and detect cardiac disorders. A description of the whole 
project can be found in j^. A set of real recorded ECG signals, taken from the 
MIT database, has been preprocessed by a signal processing algorithm into a 
symbolic representation and constitute the training base. 

We focus in this paper on the experimentation we did with ICL m and we 
demonstrate the interest of using a declarative bias as DLAB. According to the 
properties that are looked for, such as readability or robustness, different biases 
have been experimented and result in different sets of rules. 

Two main issues are currently investigated: the first one is to cope with 
multiple sources of information (multichannels and multisensors) . This means a 
new learning phase in order to get a set of chronicles able to take into account 
not only the temporal aspect of each signal but also the relationships existing 
among these different signals. The second issue concerns active cardiac devices 
which rely on leads located in both ventricles. These new devices can tackle both 
rhythmic and hemodynamic disorders but the signatures are still poorly known. 
We are currently experimenting our learning module on these data in order to 
exhibit such signatures. 
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Abstract. Cross-validation is a technique used in many different ma- 
chine learning approaches. Straightforward implementation of this tech- 
nique has the disadvantage of causing computational overhead. How- 
ever, it has been shown that this overhead often consists of redundant 
computations, which can be avoided by performing all folds of the cross- 
validation in parallel. In this paper we study to what extent such a paral- 
lel algorithm is also useful in ILP. We discuss two issues; a) the existence 
of dependencies between parts of a query that limit the obtainable effi- 
ciency improvements and b) the combination of parallel cross-validation 
with query-packs. Tentative solutions are proposed and evaluated exper- 
imentally. 



1 Introduction 

Cross-validation is a technique used in many different machine learning ap- 
proaches, such as instance based learning, artificial neural networks or decision 
tree induction, to tune parameters, select relevant features or to estimate predic- 
tive accuracies. Running an n-fold cross-validation consists of partitioning the 
data set D into n subsets Di and then running the given learning algorithm 
n times, each time using a different training set Ti = D — Di and a different 
validation set Di. 

An obvious disadvantage of performing cross-validation is the computational 
overhead of running the learning algorithm n times. This is a problem for ILP 
systems which are known to have high execution times. However, for some sys- 
tems, this overhead can be reduced significantly. The training sets Ti are highly 
similar. Indeed, each example from the original data set D occurs n — 1 times 
as training example. Because of this similarity, redundant computations are per- 
formed in the different cross-validation runs. These redundancies can be removed 
by integrating the different runs into one single algorithm. This is shown in |1] 
for decision tree induction. 

Similarities between the training sets are not the only kind of similarities that 
can cause an ILP system to do redundant computations. ILP systems typically 
search a large hypothesis space. This involves testing a huge number of first 
order logic queries on the training set. Most ILP algorithms search through the 
hypothesis space in a greedy manner. First order decision tree learners (Tilde 
E, S-Cart 0) for example consider refinements of the query from the previous 
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level of the tree when selecting a query for a new node. A refinement of a query is 
obtained by extending it with new literals. This means that different refinements 
of the same query are highly similar (share literals). One can imagine that there 
will be redundant computations when testing these similar queries separately on 
the training set. It is shown in that this kind of redundancies can be removed 
by integrating the similar queries in one so-called query-pack. 

A first goal of this text is to discuss efficient cross-validation from an ILP 
point of view. We use decision tree induction to explain the concepts, but the 
method for efficient cross-validation can also be integrated in rule induction 
systems like FOIL m or Progol |9] . We discuss an important problem related to 
the fact that a query in a given node of a first order decision tree depends on the 
queries of higher level nodes. This query-dependency problem also occurs to some 
extent for rule induction. We show how the parallel cross-validation algorithm 
from [3j can be adapted to reduce the overhead caused by this problem. A second 
goal of this text is to investigate how the query-packs from |2] can be integrated 
in the parallel cross-validation algorithm. 

This paper is organised as follows. Section 2 summarises logical decision tree 
induction, efficient decision tree cross-validation and query-packs. Section 3 dis- 
cusses the query-dependency problem, shows how query-packs can be integrated 
in the parallel algorithm and suggests how the parallel algorithm can be modified 
for rule induction. Section 4 presents experimental results. We investigate the 
possible efficiency gain of combining parallel cross-validation with query-packs 
and the effect of the query-dependency problem. Section 5 states the conclusions. 

2 Preliminaries 

2.1 Logical Decision Tree Induction 




Fig. 1. A first order decision tree. 



A first order decision tree [T] is a binary decision tree with conjunctions 
of first order literals in the nodes. The leaves contain class values in case of a 
classification task or (vectors of) real values in case of a regression task. An 
example tree grown on one of the Bongard data sets [B] is shown in Fig. [H The 
prediction task for this set is classifying pictures containing circles, squares and 
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triangles as positive or negative. We use the learning from interpretations setting 
1^ in which each example is given by a set of (Prolog) facts. Notice that it is 
not necessary to include a key variable in this setting. 

First order decision trees are grown top down. The induction algorithm starts 
with the trivial query true. It then continues to add new nodes to the tree 
until a stop criterion is satisfied. The literal for a new node is selected by a 
greedy algorithm. It first generates refinements of the current query by extend- 
ing it with new literals. In the Bongard example possible refinements of the 
query circle(C) are (circle(C), tricuigle (D) ) , (circle(C), square(D)), 
(circle(C) , in(C,D)), . . . The algorithm computes a quality measure such as 
information gain PI for each refinement. The refinement that maximises this 
quality is used to create the new node. 



2.2 The Parallel Algorithm 



1 - = i^one,-oo) 

2. for each refinement q 

3. = 0 

4. for each i £ 1 . . . n 

5. for each e £ Di 

6. update_statistics(5'P, g(e), e) 

7- 

Q qT qD qD 

9. for each i £ 0 . . .n 

10. Q = compute_quality(S'j^) 

11. if Q > Q* then {q* ,Q*)i = {q, Q) 

12. for each different q* £ {g*} 

13. partition Di according to q* 



Fig. 2. Parallel cross-validation. 



As explained briefly in the introduction, decision tree cross-validation in- 
volves growing n different trees, each on a slightly different training set Ti = 
D — Di. Because the training sets Ti are highly similar, one can expect that the 
trees will be highly similar too, especially near the root. The parallel algorithm 
P] shown in Fig. Elexploits this similarity while growing n -I- 1 different trees at 
once. One tree for each cross-validation fold and one tree grown on the whole 
data set D — Tq. We call this set of n -I- 1 trees the cross-validation forest. The 
algorithm keeps track of a tuple {q*,Q*)i (Line 1) for each tree of the forest. 
The first component of this tuple, q*, is the best query found so far and the 
second component, Q*, is q*’s quality. In each iteration another refinement q is 
evaluated on the data (Lines 3 - 10). Line 11 updates {q*,Q*)i if q is better than 

qt- 
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Lines 3-6 compute statistics for query q on each set Di of the cross- 
validation partition. Because these sets are disjoint, the query is evaluated only 
once on each example from D. To estimate the quality Q (Line 10) we need 
statistics Sj on the training sets Ti, which are derived from the previously 
calculated (Line 7-8). This step is only possible if the statistics are additive, 
meaning that if Ai n A 2 — 0. Each statistic S' is a tuple 

with two components PS and NS. The positive component PS is updated if 
q{e) succeeds (Line 6) and the negative component NS is updated if q{e) fails. 







circle(C) ? 




yes I no 


1 1 

folds {0,1,3} folds {2} 


[neg] 


in(C, D) ? 




triangle(D) ? 





Fig. 3. A 3-fold cross-validation forest. 



The last two lines of the algorithm update the nodes of the forest. A forest 
is not really a set of disjoint trees, there is some amount of sharing between the 
trees. The trees continue to share nodes as long as the same query q* is selected 
for each tree. A group of trees or a single tree can be split off from the forest if 
the algorithm selects a different q* for this group. An example forest is shown 
in Fig El At level two the tree for training set T 2 is split off from the forest and 
stops sharing computations. 

It is shown in ^ that the speedup factor Tgeriai/Tparaiiei is given by (tr(*) is 
the average time for growing one level of a single tree from D and f{i) is the 
average number of tree-groups that have been split off in the forest): 

n • tr(l) + « • tr(2) + n • tr(3) + . ■ . - . 

f^r(l) + /(I) • tr(2) -I- /(2) • A(3) -I- . . . 

Because tree-groups usually split off at lower levels of the forest, where only a 
few examples are left, speedup is in most cases quite good. 

2.3 Query-Packs 

Figure 0 shows refinements for the query circle(C), in(C,D). The parallel 
algorithm considers each of these refinements if it has to select the best query 
for the left - left subtree of the Bongard forest from Fig. El All these refinements 
have the first two literals in common and executing these queries separately on 
the training set will cause redundant computations. By integrating the queries 
in a query-pack [2] as shown in the right part of Fig. 0 this redundancy can be 
removed. 
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A query-pack is a tree structure with literals or conjunctions of literals in the 
nodes. Each path from the root to some node represents a conjunctive query. 
Decision tree query-packs can be compared to brooms. The current query, its 
length being proportional to the current tree depth, forms the stick of the broom. 
It is shown in that the speedup factor Tsequentiai/Tpack ranges from 1 to 
min(c -|- 1, 6) where b is the branching factor of the pack and c is the ratio of 
the computational complexity in the shared part over the complexity in the 
non-shared part. Because a broom has a long shared part and a high branching 
factor, one can expect high speedups. 
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Fig. 4. A query-pack. 



Note the similarity between a query-pack and a cross-validation forest. The 
structure and the basic idea are the same but the goals are different. A cross- 
validation forest represents the shared part of similar decision trees and a query- 
pack represents the shared part of similar queries. 



3 Parallel Cross-Validation in ILP 

In this section we introduce the query-dependency problem and show how the 
query-packs from the previous section can be integrated in the parallel cross- 
validation algorithm. We discuss the latter in the context of first order rule and 
constraint induction systems. 

3.1 The Query-Dependency Problem 

In the parallel algorithm from [1] sharing of computations stops once a group of 
folds is split off from the main forest. In the propositional case, it is relatively 
easy to share computations between different fold groups. For ILP systems this 
is more difficult because of the query-dependency problem. In this section we 
discuss this query-dependency problem and show how the parallel algorithm can 
be improved to allow sharing of computations among different fold groups. 

Consider again the example forest from Fig. [31 Let be the group for folds 
{0,1,3} and the group for fold {2}. The best query for G^, = (circle (C) , 

in(C,D) ) , differs from the best query for G^, = (circle (C) , triangle (D) ) . 
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Although and are different it is likely that the partitions {D^,D~Y and 
{D ^ they induce have a large overlap 0). This is because 

q^ and q^ both are best queries on a similar data set. 

Suppose that the partitions are similar. We now move to the next level of 
the forest and evaluate all refinements of on and all refinements of q^ on 
This does not involve redundant computations because the refinements of 
q^ are all different from the refinements of q^. This is because the refinements 
of q^ contain the literal in(C,D) and the refinements of q^ contain the literal 
triangle (D). This effect, to which we referred in the introduction as the query- 
dependency problem, does not occur for a propositional decision tree learner 
because this kind of system does not have variables that link tests from different 
levels of the tree. 




1 . triangle(D) 

2. square(D) 

3. circle(D) 

4. in(D,E) 

5. triangle(E) 

6. square(E) 

7. circle(E) 

8. in(C,D) 

9. in(D,E) 



Fig. 5. Overlapping refinements. 



Figure[5]shows all refinements for q^ and q^. Each refinement is a conjunction 
of 3 literals. The first literal is always circle (C), the second either in(C,D) or 
triangle (D). For the last literal we have three different cases: it occurs only 
in q^’s refinement set (literals 1 - 4 in Fig. 0, it occurs only in g^’s refinement 
set (8 - 9) or it occurs in both sets (5 - 7). The last case is of course the most 
interesting. Is it possible to remove the second literal and gain efficiency by 
evaluating refinements 5-7 only once instead of twice on the data? 

Smartcall [3] is a query transformation that removes literals from a clause 
which are known to succeed. It first partitions the query in equivalence classes. 
Two literals are in the same class if they share (indirectly) variables. Refine- 
ment 5 of q^ for example, can be partitioned in two classes Ci = {circle (C) , 
in(C,D)} and C 2 = {tricUigle(E)}. The literals in Ci are known to succeed 
because they were used to partition the data at a higher level of the tree. We 
remove Ci from the query and obtain triangle (E) . 

Smartcall removes the first two literals of refinements 5-7 for both q^ and 
g^. This is shown on the left side of Fig. E] The rectangle represents the overlap 
of the refinement sets in the vertical dimension and the overlap of the example 
sets in the horizontal dimension. If the overlap is big in both dimensions then we 
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can expect a high speedup by evaluating overlapping queries only once on the 
intersection of the example sets. Remember that the overlap of the example sets 
is large if similar queries are selected in a higher node of the tree. The overlap 
in the refinement sets is large if Smartcall can remove the crucial literals. Notice 
that propositional tests can always be removed because they do not introduce 
new variables. 

Although we have considered only two groups and G^, everything said in 
this section can be generalised to fc > 2 groups. One problem is that the inter- 
section of the refinement sets and the intersection of the example sets is smaller. 
We solve this by adding groups using a greedy algorithm until the number of 
examples in the intersection drops below a given threshold. 



circle(C), in(C,D), triangle(D) 

circle(C), in(C,D), square(D) 

circle(C), in(C,D), circle(D) 

circle(C), in(C,D), in(D,E) 

triangle(E) 

square(E) 

circle(E) 

circle(C), in(C,E) 
triangle(D), in(D,E) 






D' 



D" 






Fig. 6. Overlapping refinements and data sets. 



Another problem we did not discuss so far is the problem of anti-similar 
queries. Suppose that we have a classification task with two classes pos and neg. 
If query <71 moves almost all positive examples to the left and query <72 moves 
almost all positive examples to the right then we have anti-similar queries. The 
extreme case is that qi = ~'q2- This problem can be solved by making sure that 
the left subtree always covers the most positive examples. We swap the left and 
right subtree if this is not the case. A similar approach can be used for regression 
(put the set with the lowest mean on the left) or for classification problems with 
more than two classes (impose an artificial order on the classes and put the set 
with the smallest most frequent class on the left). 

3.2 Combination with Query-Packs 

Figure | 7 ] shows the parallel cross-validation algorithm adapted to use query- 
packs. Line 3 of the adapted algorithm creates the pack Q and returns the 
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number of queries in the pack (i.e. the pack size) s. Lines 4-6 evaluate the pack 
on the data. Because the query-pack integrates s different refinements, we have 
to keep track of a (n-|- 1) x s statistic matrix. The rows of this matrix correspond 
to the different trees in the forest and the columns correspond to the different 
queries in the pack. 



1- {<l*,Q*)o..r, = inone,-oo) 

2. refine_and_compile_pack(Q, s) 

3- BS(X.n,l..s = 0 

4. for each id 1 . . . n 

5. for each e £ Di 

6. execute_pack(Q, PSP , e) 

7. PS^ = 

8 . PSl,^ = PS^ - PSP,^ 

9. for each j £ 1 .. .s 

10 q = get_from_pack(Q, j) 

11. for each i G 0 . . . n 

12. NSpj = TSf - PSpj 

13. Q = compute_quality(r5',^, PSfj, NSpj) 

14. if Q > Q* then = {q, Q) 

15. for each different q* G {?* } 

16. partition Di according to q* 



Fig. 7. Packs version of the parallel algorithm. 



The leaves of the pack contain update.statistic functions (See Fig. 2]). 
These functions update the positive component PSj^j of the statistics SP for 
each query qj in the pack. Updating the negative component of 5/^ is not pos- 
sible with the pack representation discussed in Section 12.31 However, NSfj can 
be calculated indirectly, by introducing total statistics TSf and TSf. Total 
statistics are equal for all refinements qj and can be computed from the data 
before the parallel algorithm is started. The negative component NSpj can be 
derived using the equality TSp = PSjj + NSf j (Line 12). The rest of algorithm 
is similar to the version without packs (See Fig. EJ. 



3.3 Rule Induction 

Although we focused on decision tree induction, almost everything said so far 
also applies to top down rule induction (e.g. FOIL m, Progol [HI) and to top 
down constraint induction (e.g. ICLH). A top down rule induction system tries 
to cover all positive examples by learning a disjunction of conjunctive rules. Each 
time a new rule is learned, the systems removes the covered positive examples 
from the data set and tries to learn a next rule until all positive examples are 
covered or no more good rules can be found. To learn one rule, the system starts 
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from the most general rule true and keeps adding the best literal according to 
some quality measure as long as the rule’s quality improves. 

We can use the ideas from the parallel decision tree cross-validation algo- 
rithm to build a parallel rule cross-validation algorithm. This parallel rule cross- 
validation algorithm builds a tree that represents the shared part between rules 
for different folds. If one or more folds select a different literal then the rule for 
this group of folds is split off from the main tree and forms a new branch. This 
can be compared to the cross-validation forest from Section 12.21 As long as a 
group of folds remains together, computations can be shared by evaluating the 
rule for this group on the disjoint sets Di and not on the overlapping training 
sets Ti. 

If it is possible to use the Smartcall query transformation (i.e. if the positive 
examples that are not covered by the current rule are removed each time a new 
literal is added) then it is also possible to share computations between different 
branches of the rule tree in a similar way as discussed in Section f,3.1 1 

Before a rule induction system starts to learn a new rule, it removes all 
positive examples covered by the previous rule. As long as the same rules are 
selected for different folds of the cross-validation (i.e. the tree has no branches) 
the same examples are removed from the data set D and D remains equal for all 
folds. If fold i selects a different rule then different examples will be removed from 
D and the data set for fold i will differ from the data sets of the other folds. 
If we partition D in {D^,D^ . . .U"), where U® contains the shared examples 
and ZJ* contains the examples for fold i not in U’’, then it is possible to share 
computations over ZJ®. 

Progol uses one of the positive examples e to constrain its hypothesis space 
H to & space Hg which only contains hypotheses more general than e. After 
that, it performs an exhaustive search in Hg looking for the rule that maximises 
a quality measure called compaction. When running an n-fold cross-validation, 
each fold will select a different example but some of the hypothesis spaces Hg. 
will be equal. This means it is possible to remove redundancies among folds that 
share the same Hg. 

4 Experimental Results 

For our experiments we implemented the different cross-validation optimisations 
discussed in this text as a module of Tilde, the first order decision tree learner 
from the ACE data mining tooQ [2] . 

We compare execution times of a 10- fold run for serial (no optimisations), 
serial -I- query-packs, parallel, parallel -I- intersection (share computations among 
different groups of trees) and finally parallel cross-validation -|- query-packs. The 
data sets used are: 

— The simple (SB) and complex (CB) Bongard data set (this set was also 
used as running example in this text). SB contains 1453 examples with a 

^ ACE is available for academic purposes upon request, 
http: //www. cs .kuleuven. ac .be/~dtai/ACE/ 
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simple underlying theory, CB contains 1521 examples with a more complex 
theory. 

— A subset (ASM) of 999 examples sampled from the “Adaptive Systems Man- 
agement” data set, kindly provided to us by Perot Systems Nederland. 

— The Mutagenesis (Muta) data set jl2], an ILP benchmark (230 examples). 




Fig. 8. Different cross-validation optimisations (time relative to serial 10-fold). 



As can be seen in Figure E] Table H and Table the results are not uni- 
form for all data sets. This is because we deliberately selected different types 
of data sets. For SB all parallel algorithms perform more or less the same. The 
intersection algorithm does not perform better because no groups are split off 
from the forest. The packs version does not perform better because the queries 
are too short to have many literals in common. CB is the least surprising data 
set: some groups are split off from the forest and the queries become longer. The 
ASM data set contains many propositional numeric attributes. The combination 
of the parallel algorithm with intersection performs well on this data set because 
it has a lot of propositional attributes that can be removed by Smartcall. One 
problem is that intersecting refinement sets (using a hash tree) and example sets 
(sorted lists) is still rather slow in our implementation. Query-packs perform 
very bad on ASM because the pack has a high branching factor (it contains a 
huge number of tests comparing numeric attributes to each of their discretised 
values). The time necessary for compiling a pack depends on the pack size. Near 
the leaves this compilation time dominates the execution time which is linear 
in the number of examples. The Mutagenesis data set is not suited for parallel 
cross-validation because some queries, generated near the leaves of the forest, 
dominate the total execution time (this happens when the algorithm looks for 
circular substructures in the molecules) . Query-packs on the other hand perform 
very well for Mutagenesis (10 times faster). This is because query-pack execu- 
tion shares computations among the different refinements of this few complex 
queries. 

^ The timings are faster comparing to (4j because we have ported the statistics code 
from Prolog to C-|— f. 
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Table 1. Timings (seconds) comparing parallel to serial cross-validation (10-folds), 
once with packs disabled and a second time with packs enabled. 



SB Packs off 


Packs 


on 


Serial 16 


11 


1.5x 


Parallel 3.6 


3.5 


l.Ox 


4.4x 


3. lx 




CB 


Serial 24 


15 


1.6x 


Parallel 8.2 


5.4 


1.5x 


2.9x 


2.8x 





ASM Packs off 


Packs 


on 


Serial 4100 


4300 


0.95X 


Parallel 2100 


2600 


0.81X 


2. Ox 


1.7 X 




Muta 


Serial 5000 


500 


10 X 


Parallel 4600 


450 


10 X 


1.09X 


l.llx 





Table 2. The effect of sharing computations between different fold groups (10-folds, 
times in seconds). 



SB CB 


ASM 


Muta 


Parallel 3.6 8.2 


2100 


4600 


Intersect 3.5 7.1 


1710 


4400 


1.03 X 1.15X 


1.22X 


1.05X 



5 Conclusions 

We discussed two optimisations specific to ILP for the parallel decision tree cross- 
validation algorithm proposed in jl]. The first one was intersecting refinement 
and example sets to be able to share computations between different fold groups 
of the cross-validation forest. The second one was integrating query-packs in the 
parallel algorithm. 

One possible improvement is to implement the combination of the two opti- 
misations discussed in this text: parallel cross-validation with query-packs and 
intersection. In order to implement this, one would need an equivalent to Smart- 
call for query-packs. This is future work. 

It became clear from the experiments that different optimisations work well 
for different data sets. The more optimisations that are integrated in a learning 
system, the more difficult it is for an end-user to know which optimisations are 
suited for his particular data set. Maybe it is possible to use meta-learning to 
decide which optimisations have to be used for a given set. 

We also discussed how the ideas from the parallel decision tree cross-valida- 
tion algorithm can be used to devise a parallel rule or constraint induction 
cross-validation system. 
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Abstract. Semi-structured documents are now commonly used for ex- 
changing information. The aim of this research is to apply deductive and 
inductive reasoning to semi-structured documents. From our observa- 
tion that first-order terms are inadequate for modelling semi-structured 
documents, we model them with hedges. After defining semi-structured 
documents and hedges so that they can contain logical variables, we in- 
troduce hedge logic programs, in which every argument of an atom is 
a hedge. We give a method for transforming hedge logic programs into 
original logic programs. We also give an algorithm for computing mini- 
mal common anti-unifications of hedges, with aiming inductive reasoning 
of hedge logic programs from sets of semi-structured data. 



1 Introduction 

Semi-structured documents are now commonly used for exchanging information. 
HTML and XML are most famous languages in which flat text documents are 
marked up into semi-structured ones. In exchanging information we firstly define 
types of data and then represent a method of transforming a type of data into an- 
other type of data. It is well-known that logic programs are useful in both of the 
activities. Moreover, inductive logic programming techniques could contribute to 
automate them. In this paper we investigate how to treat semi-structured doc- 
uments in logic programming. We also give a fundamental result for inductive 
reasoning of logic programs from sets of semi-structured documents. 

Some logic programming systems, e.g. Pillow [5], have already support the 
treatment of semi-structured documents. They model semi-structured docu- 
ments with first-order terms, but as explained in this paper, such modelling 
causes problems. Instead of using first-terms, we propose to adopt hedges [B]. 



C. Rouveirol and M. Sebag (Eds.): ILP 2001, LNAI 2157, pp. 240- 12471 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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<TABLE> 

<TRXTD>John Doe</TD><TD>34</TDXTD>John.Doe@f oo . com</TDX/TR> 
<TRXTD>Bob Smith</TDXTD>29</TDXTD>Bob. SmithSf oo . com</TDX/TR> 
<TRXTD>Alice Miller</TDXTD>26</TD> 

<TD>Alice .MillerSf oo . com</TDX/TR> 

</TABLE> 



Fig. 1. An example of a semi-structured document 



Hedges are sometimes called forests in or ordered forests [1] . Some previous re- 
search IWol has shown that hedges are adequate for modelling semi-structured 
documents. 

In the following section we explain that modelling semi-structured documents 
with first-order terms is inadequate in logic programming. Then we formally 
define semi-structured documents and hedges so that they can contain logical 
variables. In Section El we introduce logic programs in which hedges are used as 
arguments of atoms. We give a method for transforming such hedge logic pro- 
grams into original logic programs with keeping their procedural semantics based 
on SLD-resolution. In Section U we give an lea (least common anti-unification) 
algorithm for two hedges in some class, with aiming inductive reasoning of hedge 
logic programs from sets of semi-structured data. In the last section hedge logic 
programming is compared with other types of extension of logic programming. 

Because of the limit of space, precise discussion including the proofs of a 
theorem and lemmata is omitted in this paper. It will be given in a forthcoming 
full paper. 



2 Semi-structured Documents and Hedges 

Let us consider a semi-structured document di displayed in Fig. [I] We put some 
spaces and newlines for readability. By using the tag names TABLE, TR and TD 
as function symbols, the document di could be represented in a first-order term 

ti = TABLE (TR(TD( John Doe) ,TD (34) , TD ( John . DoeSf oo . com) ) 
TR(TD(Bob Smith) ,TD(29) ,TD (Bob. Smith@foo.com)) , 
TR(TD(Alice Miller) ,TD(26) ,TD(Alice .Miller@foo . com) ) ) 

As is well-known, the document di is for representing a table in HTML which 
can have arbitrary numbers of rows marked-up with <TR> and </TR>. This means 
that the arity of the function symbol TABLE in t\ may vary. 

A problem is caused by the variation of arities of function symbols, when 
we define the type of t\ with a logic program. Since there is no bound of the 
numbers of arities of the function symbol TABLE, we have to prepare infinitely 
many clauses: 

is_table(TABLE(xi)) ^ is .column ( a; i) 
is_table(TABLE(a:i, X 2 )) ^ is_column(a;i), is_column(a: 2 ) 
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is_table(TABLE(a:i, X 2 , a^a)) ^ is_column(xi), is_column(a; 2 ), is-Column(a: 3 ) 



We cannot replace all of the clauses with a clause Ci = is_table(TABLE(x)) ^ 
is -Column (a;) though only one DTD rule <! ELEMENT TABLE (TR)+> is sufficient 
to represent the type which di belongs to. This shows that first-order terms 
are not adequate for modelling semi-structured documents. If we use hedges as 
arguments of atoms in logic programs, the clause Ci is enough in the definition 
of the type of table documents. 

Now we formally define semi-structured documents and hedges with context 
free grammars. We represent a grammar with a tuple G = (V, T, i?, S), where V 
and T are finite sets of non-terminal symbols and terminal symbols respectively, 
i? is a set of production rules, and S is the start symbol. The set R is allowed 
to have infinitely many rules. 

Let N and S be mutually disjoint sets. Each element in N is called a name. 
For every name n we prepare a pair of new symbols and e„, which are re- 
spectively called a start-tag and an end-tag. The set of start-tags (end-tags) is 
denoted by Bjq (E^, resp.). We also prepare a set X of logical variables (or 
variables) so that semi-structured documents and hedges can be used in logic 
programs. 

Definition 1. A well-formed pattern over (E,N,X) is a word in the language 
L{Gw) generated by a context-free grammar Gw = {Vw ,Tw , Rw ■, S) , where 
hvT = {'S'} U {T„ I n G N}, Tw = E U X D U Ej\f, and 

R^= {S ^ e} U {S ^ SS} U {S ^ Tr,; n G N} 

U {Tji — )■ bjiScji ^ n G IV} U c ; c G E VJ X } 

A well-formed document is a well-formed pattern without any logical variables. 
The set of well-formed patterns over (E,N,X) is denoted by W{E,N,X). 

In XML the set N is a subset of defined according to the specification 
(http://www.w3.org/XML/), and bn (e„) is a string <n> (</n>, resp.). 

Definition 2. A hedge h over (E,N,X) is defined inductively as follows: 

1. The empty string £ is a hedge. 

2. A terminal symbol c G E is a hedge. 

3. A logical variable a; G A is a hedge. 

4. If n G iV is a name and h is a hedge, then n{h) is a hedge. 

5. If hi and /12 are hedges, then a concatenation hih 2 is a hedge. 

A hedge without any logical variables is called a ground hedge. The set of hedges 
over (A, iV, X) is denoted by "H(A, N, X). For a hedge of the form n{hi ■ ■ ■ hm) 
hi, - , hm are called siblings. 

We often write {hi ■ ■ ■ hm) for a hedge hi - ■ - hm (m > 0) for readability, though 
this pair of parentheses without any name is not a part of syntax. 
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We define a grammar Gh = {Vw,Tw , Rh, S) where Vw and Tw are same 
that used in G\y, and Rh is obtained by replacing each rule of the form T„ — )► 
bnSen with Tn — >■ n{S). It is easily checked that the word accepted by Gh is a 
hedge in RiS, N, X), and any hedge in N, X) is accepted by Gh- Moreover, 
directly from the definition of Gw and Gh, every well- formed pattern w can be 
translated into exactly one hedge hdg{w), and conversely, every hedge h can be 
translated into exactly one well-formed pattern w such that h = hdg(w). 

Example 1. The hedge hdg(di) is obtained by removing all commas from ti. 



3 Deductive Reasoning on Hedges 

Let 77 be a set of predicate symbols, with each of which a natural number arity 
is associated. We define an atom as an expression of the form p{hi, h 2 , ■ • ■ , hn) 
where p G U, n is the arity of p, and hi G T~L{X, TV, X) for every i = 1, 2, . . . , n. 
Definite clauses and goal clauses are defined with such atoms in the same manner 
as in the original logic programming. A hedge logic program is a finite set of 
definite clauses. 

We use SLD-resolution for the execution of hedge logic programs. We have 
to take it into consideration that there is no mgu (most general unifier) of two 
hedges in general. This problem is solved by assuming some restriction to the 
occurrences of variables in definite clauses and goal clauses. An example of such 
restriction was given in . 

Example 2. Let Pi be a hedge logic program 

{ trans{x,y) G- t{x,y) 
t(TABLE(x),TABLE(y)) ^ s{x,y) 
s((TR(a;TD(z)) y), (TR(TD(z)) w)) ^ s(y, w) 
s{e, e) -G- 

and Gi a goal clause ^ trans{hdg{di),x), where <7i is the well-formed docu- 
ment in Fig.[Il By an SLD-refutation of Pi U {Gi} we obtain an answer which 
substitutes the variable x in Gi with a hedge 

TABLE(TR(TD( John.DoeOf 00 . com) ) 

TR(TD(Bob . SmithSf 00 . com) ) 

TR(TD(Alice .MillerOfoo . com) ) ) 

Each hedge is regarded as a congruence class of first-order terms, under an 
equality theory representing associative law for catenation and idempotence of 
the empty hedge. Based on this fact we give a method with which every hedge 
logic program can be transformed into an original logic program using lists with 
function symbols [ ] and [ | ] , and an auxiliary relation append, . 

Definition 3. For a hedge h G TL{X, X, N), we inductively define a first order 
term a{h) and a sequence of first-order atoms P{h) as follows: 

1. a(e) = [] and /3(e) = (/>. 
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2. For a constant c£ S a{c) = [c] and (3{c) = (p. 

3. For a variable x € X a{x) = x and (3{x) = <p. 

4. li h = n{hi) for some name n € N, then a{h) = a{n{hi)) = n{a{hi)) and 

5. If ft, = ftift 2 , then a{h) = y and /3(ft) = append{a{hi) , a{h 2 ),y), (3{hi), (3{h2) 
where y is a fresh variable. 

Each of a and f3 can be easily extended to a function on atomic formulas. For a 
definite clause C = Aq Ai, . . . , of a hedge logic program P, we define 

7r(C) = a(Ao) ^ a{Ai), a{Ak), P(Ao), P{Ai), P{Ak) 

and we define tt{P) = {7t(C) | C G P}. 

Lemma 1. Let P he a hedge logic program, and G be a goal clause. Then there is 
an SLD-resolution for PUG iff there is an SLD-resolution 7r(P//)UPappen(iU7r(G), 
where Pappend is a logic program defining the predicate append. 

Example 3. A definite clause p{x <T'D>{y) z) q{x),r{y),q{z) in a hedge logic 
program is translated to a first-order definite clause 

p{w) ^ q{x), r{y),q(z), appendix, [td{y)],u),append{u,z,w) 

4 Inductive Reasoning on Hedges 

In this section, we give an anti-unification algorithm for hedges. For a hedge p, 
we define the size of p, denoted by size{p), by the total number of symbols in 
A U A U iV. Note that we allow the empty substitutions. 

Definition 4. Let p, q be hedges. If there exists some substitution 9 such that 
p9 = q then p is more general than q and written p > q. If p > q but q ^ p then 
p is properly more general than q and written p > q. If p > q {p > q), then p is 
a generalization of q {proper generalization of q).lf p> q and q > p then we p 
is equivalent to q and written p=q. 

Definition 5. For a hedge ft, we define the language of ft by the set L{h) = {g \ 
hO = g, g is a, ground hedge, ft is a substitution} of ground hedges obtained by 
substituting ground hedges for the variables in ft. 

Clearly, p = qif and only if p and q are obtained from each other by renaming 
variables. Furthermore, if p > q then size{p) < size{q) holds. 

There are no unique least common anti-unification (lea) or the least general 
generalization (Igg) for hedges. Thus, we define below a weaker notion of mca. 

Definition 6. Let q\ , q 2 be hedges. A common generalization of q\ and q 2 is a 
hedge such that p > qi for every i = 1,2. A minimal common anti-unification 
(mca) of qi and q 2 is a common generalization p of q\ and <72 such that p p' 
for any common generalization p' of qi and q 2 . 
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There are exponentially many mca’s of a given pair of hedges, while there is 
the unique lea for first-order terms. It will be difficult to compute all of the mca 
of a given pairs of hedges. Thus, we will concentrate on the problem of finding 
one of the minimal anti-unifications for simple subclass, called simple hedges. 

Definition 7. A hedge v is simple if v has mutually distinct logical variables 
and any set of siblings contains at most one logical variable. 

Example 4- Let E = {a, 5 , c}, X = {x,y, z, . . .}, and N = {f,g}. Then, h\ = 
(a X f{y g{b))), /i2 = {g{f{x) f{y) f(z))) and I13 = (x f{y f{a z))) are examples 
of simple hedges. On the other hand, the hedge I14 = {a x f{b) y a), /15 = 
{x f{x)) and = {g{f{x) f{x))) are not simple because /14 contains x,y among 
siblings, and /15 and he contain repeated occurrences of x. 

A key of our anti-unification algorithm for hedges is a search operator called 
the refinement operator defined as follows, which are used to search all hedges 
from general to specific. 

Definition 8. A substitution 9 for a hedge p is called simple if 9 has one of the 
following forms: (i) x := xn{y), (ii) x := n{y) x, (iii) x := y c, (iv) x := cy, and 
(v) X := e, where n € N is a, name, c € A is a letter, x G var{p) and y G var{p) 
are variables appearing and not appearing in p, respectively. 

Definition 9. For simple hedges p, q, if there exists a simple substitution 9 such 
that p9 = q then we write p => q. We denote by ^ and the transitive closure 
and the reflexive transitive closure of respectively. If there exists a sequence 
Po = P Pi ^ Pn = q {n > 0), where Pi {1 < i < n) of simple hedges 

then we write p ^ q. This sequence is called a derivation sequence for p ^ q. 



Definition 10. Let p, qi,q 2 be simple hedges. Then, we define the set p{p, gi, (72) 
of immediate refinement of p by 

p(p, Pi, 92) = {p' I p' is a simple hedge, p ^ p',p' > qi,p' > 92} 

Lemma 2. For any simple hedges p, 91,92 such that p > qi for every i = 1 , 2 , 
the set p(p, 91,92) is of cardinality 5n, of polynomial size, and 0{n^) time com- 
putable, where n = size{p). Furthermore, it is decidable in 0{n^) time to check 
whether p{p,qi,q 2 ) is empty. 

In Fig. El we present an efficient algorithm for computing one of the simple 
mca’s of a given pair of hedges. 

Theorem 1. For any pair of simple hedges 91,92, the algorithm AU of Fig. E] 
computes one of the minimal anti-unifications 0/91 and 92 in time 0(n^), where 
n = size{qi) -\- size{q 2 ) is the total input size. 
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Algorithm AU{qi, q^) 

Input: Simple hedges qi,q2- 

Output: Finding a mca of gi, 52 within simple hedges. 

1 p:=x; 

2 while p(p, 51,52) 0 do begin /* Apply refinement operator */ 

3 Select any p' G p{p, 51 , 52); 

4 

5 end; 

6 output p; 

Fig. 2. A polynomial time algorithm for computing a minimal common anti-unification 
<TABLE> 

<TRXTD>Forecast for 04 . 22</TD><TD>Tokyo</TD><TD>Strasbourg</TDX/TR> 
<TRXTD>Today</TDXTD>Suiiny</TDXTD>Rain</TDX/TR> 
<TRXTD>Tomorrow</TDXTD>Partly Cloudy</TDXTD>Showers</TDX/TR> 
</TABLE> 

<TABLE> 

<TRXTD>Forecast for 04 . 23</TDXTD>Tokyo</TDXTD>Strasbourg</TDX/TR> 
<TRXTD>Today</TDXTD>Cloudy</TDXTD>Showers</TDX/TR> 

</TABLE> 



Fig. 3 . Two well-formed documents for anti-unification 



Example 5 . Let us consider the well- formed documents in Fig. | 3 ] which describe 
weather forecast. After transforming them into hedges and then AU outputs a 
hedge h which is a minimal anti-unification. The well-formed pattern w in Fig. E] 
satisfies h = hdg(w). 

This example shows that anti-unification extracts a common structure of given 
semi-structured documents as a hedge pattern, which is general enough to be 
used in hedge logical program as the arguments of some predicates. 



5 Concluding Remarks 

Hedges are used for various technologies for semi-structured documents, espe- 
cially XML documents. A type definition languages is developed based on au- 
tomata (regular expressions) on hedges [ffij . In LMX [H] transformation of types 
can be represented as hedges. We are now confirming that hedge logic programs 
cover roles of these languages. For deductive databases of XML documents, 
forestlog m is proposed based on hedges. The usage of hedge logic programs in 
forestlog is quite different from ours because it adopts the Tp operator. 

Some researchers has proposed to model semi-structured documents with 
data structure other than hedges. Thomas m developed a Prolog program pack- 
age for representing a semi-structured document as a feature structure. However 
feature structures cannot keep the order of contents in a semi-structured doc- 
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<TABLE> 

<TRXTD>Forecast for 04.2 a:</TD><TD>Tokyo</TD><TD>Strasbourg</TD></TR> 
<TR><TD>Today</TDXTD> t/</TD><TD> 2 </TDX/TR> 

w 

</TABLE> 



Fig. 4. An output of the anti-unification algorithm 



uments. Grieser et al. |Z] proposed to apply EFSs [2] to semi-structured docu- 
ments. Since EFSs regard a semi-structured document as a string of symbols, 
they need some techniques for distinguishing names and terminal symbols. One 
of such a technique is introducing negation in EFSs as in |7] . Another technique 
is employing type definitions in EFSs. The full-paper version of this extended 
abstract will show the relation between typed EFSs and hedge logic programs. 
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Abstract. The Bayesian framework of learning from positive noise- free 
examples derived by Muggleton |12| is extended to learning functional 
hypotheses from positive examples containing normally distributed noise 
in the outputs. The method subsumes a type of distance based learn- 
ing as a special case. We also present an effective method of outlier- 
identification which may significantly improve the predictive accuracy of 
the final multi-clause hypothesis if it is constructed by a clause-by-clause 
covering algorithm as e.g. in Progol or Aleph. Our method is implemented 
in Aleph and tested on two experiments, one of which concerns numeric 
functions while the other treats non-numeric discrete data where the nor- 
mal distribution is taken as an approximation of the discrete distribution 
of noise. 



1 Introduction 

Most of noise-handling techniques in machine learning are suited for the type 
of errors caused by wrong classification of training examples into classes, e.g. 
true or false. In a powerful family of ML methods such as ILP, which uses a 
Turing-equivalent representation to produce hypotheses and can therefore hy- 
pothesise about complicated input-output relations (e.g. functions), the role of 
noise in attributes (arguments) has been recognised m but rarely attempted 
to h andle. Moreover, we are not aware of a system which would directly exploit 
the knowledge of a particular noise-distribution in arguments, despite the fact 
that Bayesian and distance-based techniques - which have recently been paid a 
lot of attention in ILP I6I8I3I16I15I - can very well serve for this purpose. 

We want to test the hypothesis that by exploiting the knowledge of a partic- 
ular noise-distribution in the data (though it may hold only approximately) we 
may outperform standard noise-handling techniques. In the next section we shall 
see how to optimally (in the Bayes sense) learn functions with unknown domains 
and normally-distributed noise in the output arguments. The outstanding role 
of the normal noise-distribution has been extensively justified in many sources 
(see e.g. PJ) namely on the basis of the central limit theorem. We implemented 
the method in the ILP system Aleph. Section E] describes an effective outlier- 
identification technique applicable in the clause-by-clause theory-construction 
performed by this system, modified as to follow the guideline developed in Sec- 
tion O 

In the experimental part (Section 2) , we first test our method on artificial 
data. In particular, we learn numeric functions representable by a one-clause 
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Prolog program. This experiment will comply with the conditions of U-learning 
m and the noise will be exactly normal. We shall then also try to slightly relax 
the conditions of U-learning. The second experiment will be based on English 
verb past tense data. These data have functional, discrete and non-numeric char- 
acter. The output argument will be damaged by altering a certain number of 
characters in the word and the continuous normal distribution of noise will only 
be approximated. This kind of errors simulates the one encountered in literal 
data digitisation by e.g. OCR systems or human transcription. The predictive 
accuracy of the resulting multi-clause theory will be significantly improved by 
the outlier-identification technique described in Section Section [^concludes. 

2 Bayesian Framework 

A standard approach to learn functions from positive data in ILP makes use of 
the closed- world assumption (CWA). Using CWA, we substitute negative exam- 
ples necessary in the normal ILP setting e.g. by an integrity constraint which 
falsifies all hypotheses which yield the output outh for an input in, such that 
there exists a positive example e{in, outf.) and outf, ^ out^. But CWA clearly 
cannot be used if the output part of examples contains noise. 

Another common drawback of functional learners is that they get no informa- 
tion from the distribution of values in the input parts of the presented positive 
examples. To get a rough idea how such information could be used, imagine that 
we are learning scalar functions on the integer (sampling) interval (—10; 10). 
Assume that the current hypothesis space is {equal(in,out),sqrt{in,out)} and 
we get two positive examples e(0, 0) and e(l, 1). Then both hypotheses are con- 
sistent with the exa mples but sqrt/2 has higher posterior probability (in the 
Bayes sense) since it is less general (defined only for non-negative inputs). 

Both of these problems will be treated in the following framework embedded 
in the Muggleton’s U-learning scheme of learning from positive data [1^. For 
ease of insight we shall formalize it for numeric data to later easily generalize 
for non- numeric data in the experimental part of the text. 

Let J be a finite set, if / and g are (real) functions on a superset of I then 
the Euclidean distance between / and 5 on / is 

S{f{I),g{I))=Y,{m-g{i)f ( 1 ) 

The normal distribution N^^cr{x) with mean // and standard deviation a is given 
as 

(a;) = exp - (2) 

ay2'K 2a‘‘ 

Let bold characters denote vectors, their elements being addressed by the lower 
index. The instance space X will be the Cartesian product of the sets of possible 
inputs / and outputs O. An instance (example )ll| e G X is then given by the 

^ We reserve plain characters for vector examples and mappings to improve readability. 
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input part in{e) G I and output part out{e) G O. These parts are in general 
vectors of \in\ and \out\ elements, respectively. A (functional) hypothesis H on 
the instance set A = / x O is a tuple {Hd Q I, h : Hd — t O). Hd is the domain of 
H and = {e G E\in{e) G Hd} is the coverage of H . H is said to be consistent 
with e G X if in(e) G Hd, H is consistent (with E) if it is consistent with all 
e G E. The mapping (such as h) corresponding to a hypothesis (such as H) will 
be always denoted by lowering the case and by hj{.) we shall denote the 
element of h{.). 

Given a probability distribution Dj on the input space and assuming mu- 
tual independence of outputs, we can express the distribution of the conditional 
probability on the instance space under the condition of validity of a hypothesis 
H as 



Dx\H{e) = Dx\H{in{e),out{e)) = 

= i?,|^(m(e))ni°f 



Di\H{in{e))Do\H,in{e){out{e)) = 
^0\H,in(e) (^OUtj (c)) 



( 3 ) 



Dx\h{^) is zero if e is not consistent with H since then Dj\u{in{e)) = 0. Oth- 
erwise, D pu{in{e)) can be expressed as 



Di\H{in{e)) = 



Di{in{e)) 

Di{Hd) 



( 4 ) 



and the conditional probability on the outputs will express our assumption of 
normally distributed error with standard deviation cr, in the output argument 






( 5 ) 



Given a prior probability distribution on hypotheses Dh, a target hypothesis 
H*, a set of examples E = ei, C 2 , ..., selected by m statistically independent 
choises from Dx\h*, the posterior probability of a hypothesis H consistent with 
E can be found by applying the well-known Bayes rule and Egs. l,SI4lK1 a,s 



P{H\E) = P{H\ei,e2, ■■■, Cm) = Dh{H)D^^{E)Dx\h{^iX2, ■■■, Sm) = 

{Hd)Di{in{e)) (owtj (ci)) 



= Dh{H)D],\E)UZ, 



D 



( 6 ) 

To choose the most-promising hypothesis, we want to maximise P{H\E) w.r.t H. 
We shall take logarithms of both sides of this equation (to maximise In P{H\E)) 
and for this sake we disassemble the rightmost side into several terms. First, it 
is argued in m that Dh should be expected to obey 



in Dh{H) = —size{H)constx 



( 7 ) 



where size{H) measures the number of bits necessary to encode the hypothesis 
H and const x is a normalising constant ensuring that Dh{H) sums to one; 
this constant is neglectable when maximising \nP{H\E). Following the same 
source, ^7^{Hd) = DJ'^{Hd) can be identified as 



In Dj'^{Hd) = —m\ngen(H) 



( 8 ) 
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where gen{H) is the generality of H (i.e. the portion of the input space covered 
by Hfi). The term In Ilfc Dj{in{ei))] = consti is constant for all 

hypotheses, so it can be neglected when maximising \nP{H\E). Finally it holds 



lnUT=in\Z'' N, 



= inn™in'if 



= -^E'ri*'ln( 









h-j {in{ 

exp — 
1 

2cr2 



{out j(ei)) — 

{outj{ei)-hj{in{ei))^ 



2a^ 



(9) 






The term — w ln((Tj\/E-) = const 2 does not depend on the hypothesis 

and can be neglected when maximising lnP(iL|if). Combining Eqs. [3l9] and 
considering Eq. [T]we arrive to the fact that to maximise \nP{H\E) we need to 
maximise the function fsiH) (w.r.t. consistent hypotheses H) 



\out\ 

f e{H) = —m\\i gen{P[) — size{P[) — ^ ^^£{outj{E), hj{in{E))) (10) 

1=1 E- 

which can be simplified if there is only one output argument as 

fE(H) = —mhigen(H) — sizeiPl) — —^£{out(E), h(in(E))) (11) 

2cr^ 

The first two terms in fE{H) or f'E{H) express a generality - size tradeoff derived 
by Muggleton P! for the case of learning classification hypotheses from noise- 
free positive data. In our case of learning functional hypotheses from data with 
normal output noise, we have instead arrived to a generality - size - Euclidean 
distance tradeoff, where generality is measured on the input space (function 
domain) and the output-distance term is weighted by the inverse value of the 
variance a^. This is natural: the more noisy (more deviated) are the outputs 
in the examples, the more it makes sense to decide rather by the input domain 
data (by measuring the generality on the input domain) and prior hypothesis 
probability (reflected by the size term) and vice-versa. 

In the following we shall concentrate on single-output hypotheses and there- 
fore maximise /^(iJ). Thus the assumption of statistically independent outputs 
is no longer needed. 



3 Outlier Identification 

In a hypothesis constructed by an ILP system (ordered set of Prolog clauses 
C^,...,C"), one example may be consistent with more than one clause. Al- 
though we are learning functional hypothesis, we do not require consistency 
with at most one clause, since this would too much constrain the learning al- 
gorithm. Instead, we shall interpret the Prolog program functionally, i.e. a^ 

^ The standard Prolog once/1 predicate returns only the first-found answer whatever 
may be the number of solutio ns. 
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once (target jpi:ed±ca.te{inputs, OUT PUT)). Accordingly, we define the re- 
duced domain and reduced coverage of a clause C" as = C'd \ and 

Crc = {e G E\in{e) G Crd}- 

To select a hypothesis by maximising f'^iH) we need to have at hand a 
set of candidate hypotheses. But in a typical ILP system, hypotheses are con- 
structed clause-by-clause, therefore [l2j proposes estimates of the value gen{H) 
and size{H) based on \C^\, gen{H^), gen{H'^~^) and size(C^) where = 
{C^, ..., C"} (i.e. is the already-constructed partial hypothesis and C" the 

currently added clause) and H is the final hypothesis. In an analogical spirit, if 
(out{C^^) , U {in{C^^))) (the average distance of the output of C* from 
individual examples on its domain C*^), is approximately equal for all clauses 
C® in the final hypothesis H, we may make the following estimatiorU 

£{out{E),h{in{E))) « ^£{out{C:,),c^{in{C:,))) (12) 

l^rcl 

Let the function f%{C'^) denote the estimate of f'^iH) determined by substitut- 
ing the size, generality and distance terms by their estimates described in [120 
and the estimate in Eq.[T2l respectively. The clause C" that maximises /|:(C'") 
will then be added to the current hypothesis 

In the clause-by-clause functional hypothesis construction, we are no longer 
learning optimally (as by Eg. 11111 . The algorithm maximising f^{C^) for each 
added clause has a greedy character and we can use the following heuristic 
to improve the clause ordering: If there exists a clause with good accuracy 
on (low output-distance from) a large part of the example set but poor ac- 
curacy on a few exceptions (outliers), then this general clause should be pre- 
ceeded with a more special clause ’handling’ these exceptions. Together with the 
once— interpretation, this strategy will produce a form of a specific-to-general 
decision list, whose advantage to functional representation has been argued in 

eg. 

To attain such clause-ordering, we use the ’degree of freedom’ given by the 
seed-example selection in ILP systems like Aleph [g and Progol m- In these 
systems, the seed-example is selected randomly or in the presentation order and 
used for the construction of a bottom clause which is then suitably generalised. 
The idea of our method is that we direct the seed-example selection as to first 
choose (and cover) those examples that are outliers to some potentially good 
clause. To protect efficiency, we shall avoid backtracking (deleting previously 
constructed clauses) . 

During the computation of /|;(C'") for each candidate clause C", we also 
evaluate the function iLopc£;(C'") = maa:oc£;(/|;\o(^")) which yields the high- 
est evaluation potentially reached by C" if some example subset O (outliers) 
were avoided, i.e. covered by some previous clause. Evaluating /|;yo(C'”) for ev- 
ery O G E would be intractable, but we can avoid it by first sorting the examples 
e G decreasingly by the value (out{e) — c"(m(e))^, i.e. by their contribution 

® Remind that c" is the mapping corresponding to the hypothesis C"®. 

Where \Crc\ is taken instead of |C"| denoted as p in |12| . 
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Example No. of Left-Out Examples 



Fig. 1. Outlier Identification in English past tense data with noise variance 0.3. The 
left diagram shows the decreasing output distance contribution of each of 500 exam- 
ples w.r.t the clause past (A, B) : -split (B, A, [e,d]). The right diagram plots for each 
example ei the potential evaluation of the clause if {ei, were avoided from the 

clause’s domain. This potential evauation reaches its maximum for example no. 165. 
Examples 1-165 are thus considered outliers. 



to the distance f (om<(C"j,), c(in(C"j,))) (see Eq.[l|). Then outliers are identified by 
successively replacing examples in this order from C"^, into the (initially empty) 
set OL. The set OL which maximises during this cycle is taken as 

the outlier set and it then holds that fE\OL^^"') ~ uraa;ocE(/|;\o(^”))' d"o 
roughly see why, note that by exchangin g any example ei from OL with any 
example C 2 from C^^\OL, obtaining OLait = {e 2 }UOL\{ei}, the generality and 
size estimates in the function /® maintain the same value and the distance term 
remains the same or grows as the contribution of ei to the Euclidean distance 
is the same or smaller than that of C 2 (due to the precomputed decreasing or- 
der). Therefore /exoLciS^'^) — fEXOLi^'^)- Fig. [T] shows an ex ample of outlier 
identification in the English past tense data domain (Section 14.211 . 

In the learning algorithm, each example in E is assigned a selection-preference 
value, initiated to zero. For every evaluated candidate clause C" with outliers 
OL, the current selection-preference value of each example e G OL C E is up- 
dated by adding a value increasing with the Hope of C". When selecting the 
seed-example, the example with maximum selection-preference value is chosei0. 
This way, examples that are outliers of high- iL ope clauses will be covered in the 
earlier stages of hypothesis construction. Typical for the described method as 
implemented in Aleph is that at one stage a clause with high Hope is evalu- 
ated, rejected due to outliers, which are then forced to be covered. When the 
same clause is evaluated newly (as a result of a newly selected seed example), 
it is accepted since its outliers are already covered. It is the multiple evalua- 
tion of one clause intrinsic to the cover algorithm of Aleph that enables us 
to implement the method without backtracking. The only (slightly) superlinear 
computational overhead introduced by the technique is the sorting of examples 
by their contribution to the Euclidean distance. 



In the first step, before generating any clause, the seed example is chosen randomly. 
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4 Experiments 

4.1 Learning Numeric Functions 

In the first experiment, we want to identify numeric functions composed of the 
four elementary functions {ln(a;),sin(a:),cos(a:),a; + j/} by one Prolog clause. The 
hypothesis bias is limited by the maximum composition depth 4. There are 425 
functions in this hypothesis space assuming commutativity of addition As 
background knowledge, the learning system uses the Prolog definitions of the 
elementary functions (e.g. ln(X,Y) :-X>0, Y is log(X)). To comply with the 
framework o f U-learning, we repeatedly perform the learning process with a tar- 
get hypothesis chosen with a probability exponentially decreasing with the size 
of its Prolog notatiton. The following table lists the used set of target functions, 
their input domains within the chosen sampling interval of integers (—10; 10) for 
example presentation, and their prior probabilities. 



Target Function 


Domain € (—10; 10) 


Prior Probability 


In(cc) 


(1;10) 


1/2 * c„ 


ln(sin(a;)) 


-10, (-6; -4), (1;3), (7; 9) 


1/4 * c„ 


cos(cc) -1- ln(cos(x)) 


(-7; -5), (-1;1), (5; 7) 


1/8 * c„ 


ln(sin(x) -|- cos(x)) 


(-7; -4), (0;2), (6; 8) 


1/16 * Cn 



Examples are presented in the form e( input, output) from equal probability 
distribution on the input domain and the output value is distorted by normal 
noise. We test three learning methods. BL denotes the Bayesian technique de- 
veloped in Section |2] DBL is a simpified BL, where size and generality of hy- 
potheses are ignored when maximising i.e. we ignore the information in 

the input domain data distribution and in the prior hypothesis probability dis- 
tribution. We thus reason only on the basis of the output distance and so DBL 
corresponds to a simple kind of distance based learning. The last tested method 
is based on a simple classical manner of treating noise in real values in ILP: 
the standard Aleph (Progol) algorithm of learning from positive data is used, 
but we introduce a predicate close/2 as part of the background knowledge, 
such that close(A,B) is true if the values in A and B differ by less than 10%. 
The learner may thus identify e.g. ln(a::) from noisy-output data by the clause 
e(A,B) :-log(A,C) , close (C,B). 

Considering Fig. |2] BL clearly outperforms the other two methods, i.e. the 
exploitation of the generality and size measures proves useful (compare with 
DBL) as well as the exploitation of the Eucledian distance measure derived 
from the normal noise distribution (compare with close/2). Relaxing the U- 
learning conditions by presenting target hypotheses in equal probabilities makes 
the difference btw. BL and the other methods smaller, but not significantly. 



Cn is a normalising constant 
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Fig. 2. Learning Numeric Functions. The left diagram shows the minimum number of 
examples each of the tested methods needed to correctly identify the target function 
with growing variance in the output noise. For each method and each value of variance 
the experiment was repeated 20 times, the average result is plotted with standard 
deviation in the measurement points. The right diagram reflects a similar experiment 
where, however, the prior hypothesis probabilities were not respected, i.e. the targ et 
hypotheses were presented with equal probability. 



4.2 Learning English Past Tense Rules 

The second experiment is based on 1392 tuples of English verbs and their past 
tenses. Learning rules of English past tense by a multi-clause Prolog program 
has been studied with noise-free data MM- The background knowledge con- 
tains the predicate split/3 which splits a word into a prefix and sufRx (e.g. 
split ( [m, a, i , 1 ,e ,d] , [m, a, i , 1] , [e ,d] ) ; see |9] for typical hypotheses con- 
structed by ILP in this domain. Unlike the noise-free experiments, in our case 
the output argument is distorted by altering a number of characters in the word 
such that the probability of n wrong characters decreases exponentially with 
to approximate the normal distribution. Following is an example of 5 data with 
noise. 

past ( [m , e , e , t] , [m , e , t] ) . 

past ([m,i,n,i,s,t,e,r] , [m,i,n,i,s,t,w,r,e,d]) . 
paste [n,e,c,e,s,s,i,t,a,t,e],[n,e,c,q,s,s,i,y,a,t,e,d]). 
past ([o,b,s,e,r,v,e] , [o,b,s,e,r,v,e,d]) . 
paste [o,c,c,u,r] , [o,c,c,u,r,r ,e,f] ) . 

The normal probability distribution was discretised in such a way that for cr = 1, 
the majority of examples contained at least one error, i.e. in the language of bi- 
nary classification most of the presented positives were actually negatives. 

We compare our method with the standard algorithm of Progol (Aleph) 
whose performance is good on the noise-free past tense data nalil The integrity 
constraint (see Section EJ used in m to substitute negative examples cannot 

^ Progol was only outperformed by the method of analogical prediction whose appli- 
cation scope is rather specialised. 



256 Filip Zelezny 







Fig. 3. Learning past tense rules with RIC’s for two values of output noise variance. 
The training sets are selected randomly from the past-tense database and contain 
successively 5, 10, 15, 20, 50, 100, 200 and 500 examples; the testing set for measuring 
the predictive accuracy is always composed of 500 examples not including any of the 
training example. For each training set volume and each tested tuning of RIC, the 
experiment was repeated 20 times and the average value with its standard deviatio n 
is plotted. 



work in the noisy domain but we may use a relaxed integrity eonstraint (RIC) 
which falsifies h ypotheses giving wrong outputs for a certain minimum percent- 
age of examples. The question which percentage (tolerance) should be allowed 
for which level of noise (variance) is solved empirically in a preliminary compar- 
ative experiment of RIC’s tuned to 0%, 10%, 20%, 30% and 40% of tolerance, 
shown in Fig. [3 

We shall use the Progol (Aleph) algorithm with the best performing RIC for 
each variance (tolerance 20% for = 0.5, 30% for = 1) to compete with our 
method, which will first be simplified in the following ways. First, we require 
that any resulting hypothesis must yield some output for any input word, i.e. 
the generality of all acceptable hypotheses is identical. We therefore consider 
the generality term in f^{H) constant. Next, we limit the hypothesis bias by a 
maximum variable d epth [in and within this bias we have no reason to expect 
that prior hypothesis probability decreases with the hypothesis size, i.e. the size 
term is also considered constant. Since only the output distance term (measured 
as squared Hamming distanc^^) is then maximised, we refer to this simplified 
method as distance-based (DHL). 



E.g. the distance of the hypothesis output [a,b,c] from the example outputs {[a,b,x], 
[a,x]} would be -f 2^ = 5 because the first example differs from [a,b,c] in one 
corresponding character and to compare two lists of different length s we add a 
suffix to the shorter with characters considered mismatches, i.e. the second example 
is taken as [a,x,x] . In the normal noise distribution definition we accordingly measure 
the Hamming distance instead of the subtraction {x — y) (see Eq.[21). Such defined 
distance measure is natural in the experimented domain and different definitions 
may be suitable in other domains. 



Predictive Accuracy [%] Predictive Accuracy [%] 
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Variance 0.5 (Std. Deviation -- 0.7) 





Fig. 4. Learning past tense with RIC, DEL and DBL+OI. The experimental setup is 
identical to the previous experiment (Fig. 0), from which the best performing RIC was 
taken for comparision. 
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As we are learning a multi-clause hypothesis, the outlier identification (01) 
technique (Section]^ may be used. The performance of the three methods (Aleph 
with RIG, DEL and DBL-fOI) is shown in Fig. |4]for two levels of noise. 

We observe that the DEL method alone is comparable with the best-tuned 
integrity constraint. However, with RIG we need to first determine (e.g. empir- 
ically) a good value of tolerance, otherwise the performance may be very poor 
(Fig. 1^. This is not necessary with DEL. Note also that to maximise f^{H) the 
DEL learner does not need to know the value of the noise variance if the size and 
generality terms are considered constant. We also observe that outlier identifi- 
cation greatly improves the predi ctive accuracy of the multi-clause hypothesis 
constructed with DEL and we think this would be the case with any functional 
data with a high percentage of exception-items, as English past tense. Note also 
that the integrity constraint method cannot be further improved with 01, since 
the 01 technique is directly based on the distance measure. 

5 Conclusions and Future Work 

We have illustrated how the exploitation of the knowledge of a particular noise 
distribution in training data arguments can be utilized to outperform classical 
noise-handling techniques. Using a Bayesian framework for optimal learning of 
functional hypothesis in the presence of normal noise, we also exploit the knowl- 
edge of the prior hypothesis probability and its generality on the input domain of 
the learned function. The advantage of exploiting all these properties was shown 
in a function-learning experim ent. 

We implemented the method in the ILP system Aleph and for the clause-by- 
clause construction of hypotheses guided by this method we proposed a heuristic 
technique which forces outliers to be covered first so that general clauses can 
be accepted in the later stage of the clause-by-clause hypothesis construction. 
This ordering of clauses improves the predictive accuracy of the final hypothesis 
interpreted functionally, e.g. by the Prolog once/ 1 predicate. This was illustrated 
in an experiment with a high percentage of exceptional examples. The technique 
does not introduce backtracking into the learning algorithm. 

Our future work will focus on proving a bound of expected error related to the 
deveoped Bayesian learning with noise, similar to the one shown for the noise- 
free data case in m- Next, we want to extend the framework to non- functional 
hypotheses learning from data with normal noise in arguments. 
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